1 State-level Analysis of Pneuomonia-related Hospital Readmissions

1.1 Make a base map

Let’s practice bring up a very simple little map of the US from maps and ggplot2. The maps package provides latitude and longitude data for various in the package. The vignette for the package can be found here if you’d more information on the package. Although we will do something much more complicated for our Project 2, for this little exploration let’s just keep it simple by taking advantage of everything that is pre-loaded in maps.

First, we need to extract state-level geographic information using the map_data() function. Similar functions exist for other geographic levels of data, including world (country), county, or other regions of the world, but we will be using state.

states <- map_data("state")
Table 1. The first 6 rows of the states dataframe
long lat group order region subregion
-87.46201 30.38968 1 1 alabama NA
-87.48493 30.37249 1 2 alabama NA
-87.52503 30.37249 1 3 alabama NA
-87.53076 30.33239 1 4 alabama NA
-87.57087 30.32665 1 5 alabama NA
-87.58806 30.32665 1 6 alabama NA

The 6 columns in the states variables are:

  1. long = longitude in unprojected geographic coordinates (i.e., longitude and latitude on a sphere or ellipsoid like WGS84); ranges from -180 to 180

  2. lat = latitude in unprojected geographic coordinates (i.e., longitude and latitude on a sphere or ellipsoid like WGS84); ranges from -90 to 90

  3. group = grouping variable for polygon boundaries where all points in a given group represent a single polygon, like parts of a country, state or county. E.g., here Alabama is treated as a single group (labeled as 1 because its the first state alphabetically

  4. order = indicates the order in which points should be connected to draw the polygon; helps when connecting long and lat coordinates to ensure the shape is drawn in the correct sequence

  5. region = represents the primary geographic unit requested, e.g., here, we requested “state” so the region is takes the state names as labels

  6. subregion = refers to a subdivision within a region, but is only present for detailed maps like “county” not “state”, hence we have no subregion here

Question 1A: [0.5 points]

Now, let’s make a base map leveraging the ggplot2 package using the states dataset. Notice that x will always be longitude, y will always be latitude, and that we must also pass it a grouping variable. Why do you think we need the group variable here?

The group variable defines how points should be connected to form shapes. Without it, ggplot will come up with a way to connect the points that will not make visual sense.

Great, we’ve made our first map of the continental US! It’s pretty uninteresting just as-is, so let’s use this map to overlay our pneumonia-related hospital readmissions data.

2 State-level Spatial Analysis

To perform spatial analysis, one needs geographic-level aggregate data that has been joined with geometry. Although it might feel like we technically have everything we need, alas, we do not - quite. Remember that our ultimate goal for our stakeholder is prediction - so what this means is that we need to determine whether spatial structure is a problem before we proceed with predictions. If it is, we can include spatially lagged variables in our analysis; if not, we can proceed with exactly what we pre-processed in Project 1.

2.1 Spatial Analysis Pre-processing Steps

2.1.1 Identify your appropriate method of aggregation for each variable, but especially your outcome of interest!

But because our ultimate goal is to try to tie this back with all of the work we did in Project 1, I will actually give you data, which you get by running this code chunk. This returns both the training and testing sets, aggregated but not yet merged with geometries, so we don’t lose our other encoding and especially our transformations for OLS. They are going to be called stateAggTrain and stateAggTest, respectively.

source(file = "loadTrainTest.R")

NOTE 1 that these new aggregated training and testing states have region in them instead of state. This is so that they are already cleaned up for merging with geometries.

NOTE 2: All of the raw variables have been named Median_Raw... to distinguish them from their transformed and scaled counterparts!

Question 10A: [0.25 points]

Check that all of the states (in region) made it into the training set. If you choose to do this by counting, say with unique() or distinct(), you should get 52 if they are all present.

# Count the states
length(unique(stateAggTrain$region))
## [1] 52

We indeed have 52 regions in the training data.

Question 10B: [0.25 points]

Why am I not asking you to check the states in the testing data?

Due to the test data being a smaller subset of the original data, it may not include all of the states.

Question 10C: [0.25 points]

In the aggregation that I did for both stateAggTrain and stateAggTest I took the median for everything. Why, when we just went through all the rigmarole of figuring out if we should display the mean or the median of variables like Predicted Pneumonia Readmissions?

HINT 1: What have we already done to the training and testing sets that we had not done here because we went back to the raw data?

HINT 2: Why is a median or a mean acceptable if we’re working with scaled and centered data?

Since we are now working with our transformed training and testing data, the distributions are now artifically changed. Data that has been scaled and centered will be roughly symmetrical, meaning that either the median or mean will be acceptable.

2.1.2 Identify your geographic level of aggregation for each variable.

Here, we have selected state. We already decided this going into the previous analysis; it only makes sense to continue that level of spatial analysis first.

2.1.3 Get the right geometries for spatial analysis.

Sadly, what we used for mapping will not be sufficient. Recall that, in lecture, I kept referring to something called shapefiles and I said we wouldn’t have to worry about them a ton (yet) but that we were going to need them already. Well, here they are!

A shapefile is a widely used geospatial vector data format for geographic information system (GIS) software, storing the location, shape, and attributes of geographic features such as points, lines, and polygons. This is is typically across a set of related files with file extensions of .shp, .shx, or .dbf. Because shapefiles are the gold-standard of spatial analysis, our spatial statistics packages in R similarly expect geometrics in shapefile (often abbreviates as sf) format. We could either import shapefiles (which we will do in Project 2) or, here, we will simply convert the geometrics from the maps package into shapefile format so we can calculate our spatial statistics.

Complicated, right? Thankfully, most of that magic will be performed for us using the sf package in R. If you want to learn more about the sf package, you can find that here.

We can convert to shapefile format in a single line of code:

## Get state map geometries in shapefile format
states_sf <- st_as_sf(map("state", plot = FALSE, fill = TRUE))

Question 10D: [0.25 points]

You may need to do a little digging, but what exactly is the st_as_sf() function from the sf package doing? Can you figure out why I didn’t just use the states object we already used for mapping?

# Pull up documentation
?st_as_sf

The st_as_st() function converts data objects with spatial values into sf objects. In our case, the code is taking the state data from the maps package and converts it into an sf object.

The sf format might be prefered, as it stores the spatial data in a single geom column. This could make data transformations, analyses, and visualizations more straight forward and flexible.

2.1.4 Calculate spatial weights for our target, predicted pneumonia readmission rate.

Now, we finally get to the meat of it! Spatial weights represent the spatial relationships (or influence) between geographic units (here, states). In other words, spatial weights define how much nearby observations contribute to calculations for a given location. These weights are typically based on proximity, contiguity (shared borders), or distance, but it depends on the type of spatial autocorrelation or dependence in the data.

Recall that, in lecture, I mentioned that we were going to look at spatial autocorrelation. Spatial autocorrelation is the relative degree to which the value of a variable (e.g., hospital readmissions) at one location is similar to values at nearby locations. When nearby areas have similar values (e.g., high readmission rates clustered together), it’s called positive spatial autocorrelation (clustering); when nearby areas have very different values, it’s negative spatial autocorrelation (dispersal). Thus, you can think of spatial weights as measures of the spatial correlation in hospital readmissions between states, in our case!

We will use a package called spatial dependence (spdep, more information here) to calculate our spatial weights. We have a series of small sub-steps to go through to get there.

2.1.4.1 Turn off S2 geometry engine, if needed (usually a good idea if you didn’t import shapefiles directly or if they are older shapefiles)

By default, the sf package now uses Google’s S2 geometry engine (a library for geometric calculations on the sphere developed by Google for their maps), which is strict and throws errors when geometries are even slightly invalid (like if edges should cross each other even slightly). We will choose to turn off the S2 engine just for this step because it will fail otherwise. This is often going to be the case _when you are not importing shapefiles directly or if your shapefiles are older and predate the S2 engine__. But, we will turn it back on when we’re done.

We can disable the S2 engine like this:

## Disable Google S2 engine
sf_use_s2(FALSE)

We will also just do a quick cleanup using the st_make_valid function just to be safe. It never hurts! All it does is clean up any of those cases where the borders do cross (i.e., are invalid) so we don’t get any errors.

We can validate the geometry edges like this:

## Clean up the edges of `states_sf` to be safe
states_sf <- st_make_valid(states_sf)

Find Neighboring States

Next, we will use the poly2nb() function from the spdep package to create a neighborhood list based on the polygon boundaries of our spatial objects (states). It is using the shapefile (sf) object we already made, states_sf, to do this. What it will do is defines which polygons (states) are neighbors by looking for which ones share borders with each other (called contiguity). By default, it uses what is called “queen” continuity (derived from chess); it would define a neighboring state as one that shares any point (edge OR corner) as neighbors.

As an example, Colorado and Arizona do not share a border only a corner, right? But according to “queen” contiguity, they are still neighbors.

Question 11A: [0.5 points]

Look up “rook” continuity and explain what (1) what it means, (2) how the neighbor relationship of Colorado and Arizona would change, if it would, and (3) how we change to this type of contiguity in the poly2nb() function. (You don’t need to actually change it in the code, however.)

Rook contiguity means that states are only considered neighbors when they share a border.

Using rook contiguity, Colorado and Arizona would not be consider neighbors, as they do not share an edge.

In the poly2nb() function, you can change to rook contiguity by setting the queen parameter to FALSE.

We can find neighboring states using queen contiguity like this:

## Create neighbors using queen contiguity
states_nb <- poly2nb(states_sf, queen = TRUE)

Calculate Spatial Weights Between Neighbors

Next, we will use the nb2listw() function from spdep to convert the neighborhood list that we created with poly2nb() (called states_nb) into a spatial weights object that we can use in spatial analysis like Moran’s \(I\), LISA, and spatial regression. What nb2listw() does is assigns weights to the neighbors identified in the nb object, and those weights define how strongly each neighbor influences a spatial unit (state).

There are actually several styles of weighting; we will be using row-standardizing (style = "W") as it is the most appropriate for spatial autocorrelation. Row-standardized weights are spatial weights where each row sums to 1. This means the influence of all neighbors on a given spatial unit is normalized, so the total influence is consistent across all units regardless of how many neighbors they have. Without standardization, units with more neighbors could have a greater total weight, biasing spatial statistics. This means that states like Colorado, that have lots of neighbors, are not weighted more heavily than states like Alaska or Hawaii, which are effectively islands. This will make Moran’s \(I\) more comparable across the states.

We can calculate the spatial weights as folows:

## Calculate the state-level spatial weights
states_sw <- nb2listw(states_nb, style = "W")

## Turn Google S2 engine back on
sf_use_s2(TRUE)

Question 11B: [0.5 points]

The first state in our dataset is Alabama. Take a look at one of our maps if needed. How many neighbors do you expect it to have?

I would expect Alabama to have four neighbors (as it shares a border with four states).

Now investigate Alabama’s spatial weights:

states_sw$weights[[1]]
## [1] 0.25 0.25 0.25 0.25

Does this feel consistent (1) with what you know about the number of neighbors Alabama has and (2) what I mentioned above about row-standardizing our weights? Why or why not?

This feels consistent, as I had previously identified four neighbors. Since the weights are standardized, all the weights will add up to 1. Therefore, four weights all equal to 0.25 is exactly what we should expect.

Lastly, investigate the spatial weights for Massachusetts (number 20). Does it fit with your expectations given the number of neighbors?

# Massachusetts Weights

states_sw$weights[[20]]
## [1] 0.2 0.2 0.2 0.2 0.2

Once again, this meets expectations. There are five states bordering Massachusetts, and there are five weights of equal part that all add up to 1.

Join the Spatial Weights back to our Training Data

Our final step is to attach the neighbor lists and spatial weights back to the training data and the shapefile we already have. This way, we can proceed with the rest of our spatial analyses.

NOTE that if the shapefile spatial names (ID) did not already match region in stateAggTrain we’d have to convert that first! Further, it is at this stage you would add FIPS to facilitate merging, if needed.

We can calculate the spatial weights as folows:

## If we want to make sure that the shapefile ID matches region:
## unique(states_sf$ID)

## Join the shapefile with spatial weights and neighbor lists back to the 
## training data
states_sf_AggTrain <- left_join(states_sf, stateAggTrain, 
                                by = c("ID" = "region"))

Question 12: [1 point]

I am sure you caught this, but spatial weights and everything are being done on the training data only. What key issue am I trying to avoid by performing all of my spatial pre-processing steps on the training data only? Why am I not using the full dataset for this?

HINT: It is the same reason we did all of our other pre-processing AFTER splitting the data!

We want to avoid data leakage, which could occur if we do our pre-processing on the full data.

2.2 Global Spatial Autocorrelation: Global Moran’s \(I\)

Recall from lecture that I mentioned that our goal is to determine if there is clustering (positive spatial autocorrelation) or dispersal/dispersion (negative spatial autocorrelation). Global Moran’s \(I\) is a way to measure clustering and will tell us whether clustering exists overall. Moran’s \(I\) statistic ranges from -1 to +1, just like a typical correlation coefficient! Significant positive \(I\) inidicates clustering of similar values (high with high, low with low) whereas significant negative \(I\) indicates dispersion (high values near low values; or “opposites attract”). Just as with a canonical correlation, \(p\)-values < 0.05 suggest significant spatial autocorrelation (assuming we are using the 5% significance threshold). It’s important to note that an \(I\) = 0 would indicate randomness; thus, the null hypothesis (\(H_0\)) is that median pneumonia readmission rates are randomly distributed across the states.

2.2 1. Set the target

We are going to set the target like this to facilitate making it easier to swap out at a later question. NOTE that we are centering but NOT scaling it - why? This is because we want it to be comparable to the lagged values; this will make a little more sense as we go.

## Change this to change the target!

y <- scale(states_sf_AggTrain$Median_RawPredictedReadmissionRate,
           center = TRUE, scale = FALSE)

2.2.2 Calculate the global Moran’s \(I\)

Using the moran.test() function from the spatial dependence (spdep) package, we will calculate the global Moran’s \(I\) statistic first. If you would like more details about this function, you can find that here.

## Calculate the global Moran's I for states on training data
## Use scaled y & apply spatial weights
global_moran <- moran.test(y, states_sw)

## View the results
global_moran
## 
##  Moran I test under randomisation
## 
## data:  y  
## weights: states_sw    
## 
## Moran I statistic standard deviate = 2.9895, p-value = 0.001397
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.268531084      -0.020833333       0.009368833
## Show a plot of the top most outlier states
moran.plot(as.vector(y), 
           listw = states_sw,
           ylab = "Spatially-lagged Median Pneumonia Readmissions",
           xlab = "Median Pneumonia Readmissions (No Lag)", pch = 16,
           col = skittles[2], xlim = c(-3,3), 
           main = "Figure 3. Global Moran's I (State-Level Aggregation)")

Interpretation: Our global spatial autocorrelation test found a Moran’s \(I\) of 0.268 with a \(p\)-value of 0.0014, which is less than 0.05. Thus, we reject the \(H_0\) of a random distribution and conclude that we have significant positive spatial autocorrelation for median pneumonia-related readmissions! This suggests that states with similar median readmission rates tend to be clustered geographically in the U.S.!

The accompanying scatterplot produced with moran.plot() confirms this. On the \(x\)-axis we have median pneumonia-related readmission rates for each spatial unit (state) and on the \(y\)-axis the spatially lagged version of readmission rates. Spatial lags are the weighted average value of neighboring states (spatial units). Thus, this plot shows that high readmission states tend to cluster with other high states and low states with other low states (i.e., positive spatial autocorrelation). Outlier states in each of the quadrants of the graph are also identified.

  • High-high outliers (Quadrant I)
    • These states have especially high readmission rates and are generally surrounded by neighboring states that are also high.
  • Low-high outliers (Quadrant II)
    • These states have low readmission rates but are generally surrounded by neighboring states that are high.
  • Low-low outliers (Quadrant III)
    • These states have especially low readmission rates and are generally surrounded by neighboring states that are also low.
  • High-low outliers (Quadrant IV)
    • These states have high readmission rates but are generally surrounded by neighboring states that are low.

Question 13: [1 point]

Identify each of the outliers from the graph and interpret the type of outlier (high-high, low-high, etc.) that they are.

Delaware: Low-high outlier; this state has low readmission rates but is surrounded by states that have high readmission rates.

Utah: Low-low outlier; this state has low readmission rates and is surrounded by similar neighboring states.

West Virginia, DC, New Jersey: High-high outliers; these states have high readmission rates are are surrounded by similar states.

Question 14A: [0.5 points]

NOTICE that I had you define the target as a centered version of Median_RawPredictedReadmissionRate, which is the median of the untransformed, unscaled readmissions. Shouldn’t I have used the transformed and scaled version, bc_PredictedReadmissionRate? Can you think of any reason why I chose to use the centered median of the raw readmissions rather than the transformed version?

Centering simply shifts the data so that it is centered upon zero. This can help when comparing to lagged y. By having the data centered, we can use the quadrants to compare if points fall above or below the median. Unlike centering, scaling alters the spread of the data, which would make interpretation in the context of the original units more difficult. The box-cox transformation would also change the interpretation of the data, also hindering our ability to use it to compare to lagged y.

HINT 1: What does centering do that scaling does not? Could centering be important, for example, to make sure that \(y\) is going in the same direction as lagged \(y\)?

HINT 2: Why do we need \(y\) and lagged \(y\) going in the same general directions? Could it have anything to do with identifying which quadrant they fall into?

HINT 3: Chapter 8 by Paul Moraga on Spatial Autocorrelation may be helpful, but it also gets a bit deep in the weeds too.

Question 14B: [0.5 points]

Now go back up to temporarily reset the target to bc_PredictedReadmissionRate here. There is no need to scale it again! Then, re-run the Global Moran’s \(I\). make sure to update the limits of the \(x\)-axis or just comment that line out (xlim(...)) to look at the graph. Looking at the value of \(I\), the \(p\)-value, and the graph output - does the conclusion change? Does it matter which value we choose for the analysis?

The Moran I statistic is 0.289 with a p-value of 0.0007. Since the p-value is less than 0.05, we still reject the null hypothesis. The graph also appears to be more or less the same. Overall, it does not appear as if the choice makes a large difference.

NOTE: Make sure to change \(y\) back to the scaled version of Median_RawPredictedReadmissionRate before moving to the next analysis!

2.3 Local Spatial Autocorrelation (LISA): Local Moran’s \(I\)

Now that we’ve gotten a sense for possible outlier states and we know that we have global spatial autocorrelation in median readmission rates, it would be nice to concretely identify hotspots and coldspots. Unlike the global Moran’s \(I\), which provides an overall measure of spatial autocorrelation across the entire nation, LISA (Local Indicators of Spatial Association) helps pinpoint specific states that significantly contribute to that global pattern. LISA identifies clusters of similar values, such as high-high (hotspots) and low-low (coldspots). It can also be used to identify spatial high-low and low-high outliers, just as the global test did. This local approach is so powerful though because it enables us to map and interpret where spatial clustering or dispersion is happening, which is especially valuable for targeted policy intervention or resource allocation. So, even though it may not be as vital for our predictions, it can help us to better understand the spatial pattern of pneumonia-related hospital readmissions more generally.

2.3.1 Run local Moran’s I for LISA

To perform the LISA, we will use the localmoran() function from spdep. More information on the function can be found here. After, instead of printing the summary() I show you the head() of the first rows produced by the LISA.

## Run local Moran's I for LISA on vectorized y & applying same spatial weights
local_moran <- localmoran(as.vector(y), states_sw)
Table 5. First 6 rows of Local Indicators of Spatial Awareness (LISA) output
Ii E.Ii Var.Ii Z.Ii Pr(z != E(Ii))
alabama -0.04 0.00 0.03 -0.26 0.79
arizona 0.11 0.00 0.03 0.68 0.50
arkansas 0.00 0.00 0.00 0.02 0.98
california -0.12 -0.04 0.54 -0.11 0.91
colorado 0.79 -0.01 0.06 3.36 0.00
connecticut 0.06 0.00 0.00 1.12 0.26
  • Ii = Stands for \(I_i\), which is the local Moran’s \(I\) estimate for that state. \(I_i\) is a measure of local spatial autocorrelation for the given state. If positive, it is similar to its neighbors, if negative it is dissimilar.

  • E.Ii = Stands for \(\mathbb{E}(I_i)\), which is the expected value of \(I_i\) under the null hypothesis of spatial randomness. This value is usually close to 0.

  • Var.Ii = Stands for \(Var(I_i)\), which is the variance of \(I_i\) under the null hypothesis. It is used to assess the significance of \(I_i\).

  • Z.Ii = Stands for \(Z_{I_i}\), which is the \(Z\)-score calculated as \(Z = \frac{I_i − \mathbb{E}(I_i)}{\sqrt{Var(I_i)}}\), which indicates how extreme \(I_i\) is compared to the random expectation. Used to derive statistical significance.

  • Pr(z!=E(Ii)) = Stands for \(Probability(Z_{I_i} \ne \mathbb{E}(I_i))\), which is the probability that \(I_i\) is not equal to the expected value \(\mathbb{E}(I_i)\). In other words, it’s the p-value associated with \(Z_{I_i}\)! So, it tells you whether the local spatial autocorrelation is statistically significant for any given state.

That might feel like a lot of confusing math, but let’s start to put it together into something meaningful we can use: HOTSPOTS vs. COLDSPOTS.

2.3.2 Criteria for hotspots vs. coldspots

  • Hotspots (High-High). States with a high median readmission value surrounded by other high value states. As you get to the fringes of hotspots, other states that have high readmission rates but are surrounded by low states will not be labeled as part of the hotspot but could be labeled an outlier (high-low).

  • Coldspots (Low-Low). States with a low median readmission value surrounded by other low value states. As you get to the fringes of coldspots, other states that have low readmission rates but are surrounded by high states will not be labeled as part of the coldspot but could be labeled an outlier (low-high).

We can actually break this down a bit to help us better understand the difference between hot/coldspots and outliers - outliers are states with high readmission values that are near states really different from them in terms of readmission rates.

 

Table 6. Description of the quadrants and their corresponding LISA cluster categories.
Readmissions Neighbors LISA Category Quadrant Explanation
High High Hotspot (H-H) I State with high readmissions is surrounded by other high states
Low Low Coldspot (L-L) III State with low readmissions is surrounded by other low states
High Low Outlier (H-L) IV State with high readmissions is surrounded by low rate states
Low High Outlier (L-H) II State with low readmissions is surrounded by high rate states
High or Low Random Not Sign. (N.S.) - No difference from random (dispersed)

This implies, then, that if we identify the quadrants we identify the hot/coldspots! So, let’s do that.

Question 15A: [1 point]

  • Hotspots (Quadrant I, “High-High”) will be states that have:

  • A significant \(p\)-value (\(p < 0.05\)), which suggests that local \(I_i\) is significantly positive (\(I_i > 0\))

  • A centered median hospital pneumonia-related readmissions rate that is positive

  • A spatially-lagged median hospital pneumonia-related readmissions rate that is positive

What will the criteria for coldspot (quadrant III) states be?

Coldspots (Quadrant II, “Low-Low”) will be states that have:

A significant p-value (p < 0.05), which suggests that the local Ii is significantly negative (Ii < 0)

A centered median hospital pneumonia-related readmissions rate that is negative

A spatially-lagged median hospital pneumonia-related readmissions rate that is negative

2.3.2.1 Extract the spatially-lagged median hospital readmission rates

We can use the lag.listw() to extract the spatial lag of the target. So, these values are the spatially-adjusted median pneumonia-related

lag_medianReadmissions <- lag.listw(states_sw, y)
## Run this is you want to see what they look like!
## lag_medianReadmissions

## Just Massachusetts' value:
 lag_medianReadmissions[20]
## [1] 0.2126469

Question 15B: [0.5 points]

The raw target is the % of all pnuemonia patients discharged who are predicted to be readmitted within 30 days after being released from the hospital. So, e.g., the raw median pneumonia-related readmissions rate for Massachusetts is 16.6789, meaning that \(\approx\) 16.7% (no multiplication by 100 needed here!) are predicted to be readmitted based on a sliding-window average of 30-day readmissions.

Yet, the spatially-lagged pneumonia-readmissions for Massachusetts is only 0.2126469!

QUESTION: Why do the spatially-lagged variables seem SO shifted compared to the original value?

HINT: What did we do to the raw hospital readmissions before we calculated the spatial weights in this section?

The spatially-lagged variables seem shifted compared to the original value because we had previosuly centered the raw data before calculating spatial weights. Reversing the centering of the lagged data would help get the values closer to what is expected. Otherwise, we could center the raw data.

Question 15C: [0.5 points]

Show just Massachusetts’ median readmissions to see its value after performing the step I am referring to in the hint.

HINT: Massachusetts is at row 20 in the states_sf_AggTrain dataframe, which will make it easier to locate its value in the vector created by looking at just the Median_RawPredictedReadmissionRate column.

# Get the mean used for centering
center_value <- attr(y, "scaled:center")

# Add the mean back to the centered data
y_original <- y + center_value


lag_medianReadmissions2 <- lag.listw(states_sw, y_original)
## Run this is you want to see what they look like!
## lag_medianReadmissions

## Just Massachusetts' value:
lag_medianReadmissions2[20]
## [1] 16.2344
# Centering the Raw data
centered_values <- scale(states_sf_AggTrain$Median_RawPredictedReadmissionRate, center = TRUE, scale = FALSE)

# Show Massachusetts' centered value
centered_values[20]
## [1] 0.6571469

QUESTION: Is this a lot closer to the spatially-lagged value?

Yes, the values are more comparable.

2.3.2.2 Convert the LISA results to a dataframe

local_df <- local_moran %>% as_data_frame()

2.3.2.3 Add the local \(I_i\), \(p\)-value, and lagged readmissions to the states_sf_AggTrain dataset

states_sf_AggTrain <- states_sf_AggTrain %>%
  mutate(
    ## adds the local I
    localI = local_df$Ii,
    ## adds the p-value
    pValue = local_df$`Pr(z != E(Ii))`,
    ## adds the lagged median readmissions
    lag_medianReadmissions = lag_medianReadmissions
  )

2.3.2.4 Classify the LISA quadrants (hot/coldspots) based on the criteria

The criteria we used (in case you already forgot!) were set here. All of the steps might make it easy to forget what we are trying to do, but we’re finally at the important part: identifying the quadrants.

Question 15D: [0.5 points]

The code below identifies the quadrants and adds them to the states_sf_AggTrain dataset. Your task is to add comments to the code below where indicated. Make sure to refer back to the criteria we set here if needed!

Answer in the comments.

# Update states_sf_AggTrain with a new quadrant column
states_sf_AggTrain <- states_sf_AggTrain %>%
  mutate(
    ## define values in quadrant by the following cases
    quadrant = case_when(
      # If the p-value is greater than 0.05, label the point as N.S.
      pValue > 0.05 ~ "N.S.",
      ## COMMENT HERE - Why do I not have to specify p-value again?
      # All other cases involve p-values below 0.05.
      
      # A point is a hotspot when it falls into quadrant 1.
      # This is defined as y and lagged y both above 0.
      y > 0 & lag_medianReadmissions > 0 ~ "Hotspot",
      
      # A point is a coldspot when it falls into quadrant 3.
      # This is defined as y and lagged y both below 0.
      y < 0 & lag_medianReadmissions < 0 ~ "Coldspot",

      # A point is a high-low outlier when it falls into quadrant 2.
      # This is defined as y above 0 and lagged y below 0.
      y > 0 & lag_medianReadmissions < 0 ~ "High-Low Outlier",

      # A point is a low-high outlier when it falls into quadrant 4.
      # This is defined as y below 0 and lagged y above 0.
      y < 0 & lag_medianReadmissions > 0 ~ "Low-High Outlier",

      # Assign NA if data meets none of the described cases
      TRUE ~ NA
    )
  )

2.3.2.5 Map the LISA cluster (quadrants)

Lastly, let’s visualize our hot/cold spots on the map! Note that we are now using geom_sf() for mapping rather than geom_polygon(). This is because our dataframe now contains shapefile geometries instead of just the polygons that it did earlier, hence the change.

ggplot(states_sf_AggTrain) +
  ## Now using geom_sf() because we have shapefile data rather than just
  ## polygons; geom_polygon() will no longer work here!
  geom_sf(aes(fill = quadrant), size = 0.2, 
          alpha = 1, color = "darkgray") +
  
  ## Fill based on the quadrants!
   scale_fill_manual(name = "LISA Cluster (Quadrant)",
    values = c("Hotspot" = skittles[6],
               "Coldspot" = skittles[1],
               "High-Low Outlier" = skittles[7],
               "Low-High Outlier" = skittles[2],
               "N.S." = "#dadbd7")) +
  
  ## Makes a completely blank map background
  theme(
    panel.grid = element_blank(),
    panel.background = element_rect(fill = "white"),
    plot.background = element_rect(fill = "white"),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    legend.position = "bottom"
  ) +
  labs(title = "LISA Clustering of Median Pneumonia-related Hospital Readmission Rates",
    subtitle = "Hotspots and coldspots show spatial clusters of similar values",
    caption = "Source: Centers for Medicare & Medicaid Services (CMS), 2024")

Question 15E: [0.5 points]

Interpret what it means if Indiana is a “low-high” outlier.

If Indiana is a “low-high” outlier, this means that the state has unusually low readmission rates for pneumonia when compared to its neighbors.

Question 15F: [1 point]

Do the hot and coldspots make sense with your observations of graph p1? Why or why not? Can you explain why hot/coldspots will not necessarily correspond with individual states’ high and low median readmission rates?

The coldspots appear to be roughly consistent with what is demonstrated in p1. Furthermore, many of the hotspots are shown in p1 to have decently high readmission rates. However, many of the very high rate states in p1 are labeled as N.S. on this new map. The difference between p1 and the hot/coldspot map is that while p1 is just looking at individual state performance, the hot/coldspot map is comparing states to their neighbors. Thus, it is highlighting how states compare to the states near it.

2.4 Spatial Regression Models

Spatial regression models explicitly account for spatial relationships in data, which is something we encounter constantly in public health and epidemiology, as well as environmental studies (e.g., climate change, pollution, wildfire risk), agriculture (e.g., soil nutrients, water availability), sociology and social work (e.g., poverty, social justice), criminology (e.g., real-time crime predictions), and business (e.g., real-estate and housing prices). In other words, spatial regression gives us tools to analyze how nearby locations influence each other, especially when spatial relationships violate the assumption of independence in OLS regression!

There are two common issues that spatial regression addresses:

  1. Spatial dependence, in which outcomes in nearby locations tend to be similar (e.g., rates of a transmissable disease tend to be higher in nearby areas vs. farther areas).

  2. Spatial heterogeneity, in which relationships between variables may vary across space (e.g., housing prices, climate, or wildfire risk depend on geographic features of the landscape, sometimes features we have not even measured).

2.4.1 Spatial Lag/Autoregressive Regression (SAR)

There are three types of spatial regression models worth mentioning. The spatial lag model (aka, Spatial Autoregressive Model or SAR) is the one we will focus on first. In this regression model, the outcome depends on neighboring outcomes. This makes sense for us because we know from the LISA analysis that we have hot/coldspots of pneumonia-related readmissions!

The general structure of the SAR model is:

\(y = \rho Wy + \beta_1 X_1 + \beta_2 X_2 + ... \beta_n X_n +\epsilon\)

where \(y\) is the target variable (median pneumonia-related readmissions), \(W\) is the spatial weights matrix, which defines the neighborhoods (stored in states_sw!), and \(\rho\) is the _spatial lag coefficient__. This means that the first term of the equation, \(\rho Wy\), measures the spatial lag of the target variable as the average of neighbors’ outcomes multiplied by the spatial lag coefficient. It tells us how much the outcome deviates just do to spatial relationships with neighbors!. The remaining terms, \(\beta_1 X_1 + \beta_2 X_2 + ... \beta_n X_n\) are just the standard regression coefficients for each of the \(n\) predictors. \(\epsilon\) is an unmeasured error term, which we also have in OLS regression.

We are justified in proceeding with a SAR because, as you hopefully remember from the end of Project 1, pneumonia-related hospital readmissions could be predicted with an OLS regression! However, we also know we have spatial patterns here that we want to try account for in our predictive models, so let’s do that now.

Question 16: [1 point]

Your data have already been partitioned into training and testing sets, preprocessed and normalized, and we just calculated the spatial weights matrix, \(W\). What if we did NOT already have the data fully pre-processed? In a few sentences, describe the steps we would have to take if we were starting from the raw data.

HINT: Think of the data science lifecycle/flow we keep coming back to!

The following is an example of how we might have pre-processed the data.

First, knowing our target, we would drop the potential targets that we did not want to keep, as well as variables with no/little variance.

Second, we would split the data into our training and testing sets. We do this now to avoid data leakage later on.

Third, we would impute missing values, using a method like missForest or MICE. In addition, we would apply our frequency encoding.

Fourth, we would apply our transformation (box-cox), centering, and scaling.

Fifth, we would repeat the steps at the beginning of this demo, making sure that we properly aggregate our data at the state level, select the median data, and calculate our spatial weights.

2.4.1.1 For our convenience, make a dataframe that contains just the information we want for fitting the SAR and store it into sarTrain

            ## Start from the training data
sarTrain <- states_sf_AggTrain %>% 
  ## Make it a dataframe/tibble
  as_data_frame() %>% 
  ## Drop the features that we dropped before the analysis in Project 1, as 
  ## well as the features we do NOT want to use as predictors!
  select(-Median_RawPredictedReadmissionRate, 
         -Median_RawMedicareSpending, 
         -Median_RawHospitalReturnDays, 
         -Median_RawDeathRate, 
         -ID,
         -localI, 
         -pValue, 
         -quadrant, 
         -geom, 
         -lag_medianReadmissions)

2.4.1.3 Check residuals for spatial autocorrelation

If we didn’t already know that we have spatial relationships from the Moran’s \(I\) and the LISA, we could also consider calculating a Moran’s \(I\) on the residuals of the OLS regression. This would tell us, even if we had NOT yet done a Moran’s \(I\) or LISA, that whether we should pay more attention to our spatial relationships and explicitly include them in our regression model.

\(H_0\): The null hypothesis is that the residuals are not spatially autocorrelated. Thus, the OLS is not misspecified and we do not need to worry about spatial relationships in our target.

## Moran's I test on residuals
moran.test(residuals(OLS), states_sw)
## 
##  Moran I test under randomisation
## 
## data:  residuals(OLS)  
## weights: states_sw    
## 
## Moran I statistic standard deviate = 1.5437, p-value = 0.06133
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.128852929      -0.020833333       0.009402604

Question 17B: [1.5 points]

  • What would we conclude from the results of the Moran’s \(I\) test on the residuals?

The Moran I statistic is 0.129, and the p-value is 0.06, which is greater than 0.05. We conclude that we fail to reject the null hypothesis. Our residuals do not appear to be spatially autocorrelated.

  • If we had started here, rather than calculating the Moran’s \(I\) and LISA first, could we have been potentially mistaken about spatial relationships? Why or why not?

Yes, because the model accounts for the spatial autocorrelation, masking the underlying spatial relation of the data.

  • Why might these results disagree with what we found from the Moran’s \(I\) and LISA on the scaled target?

The two analyses are focused on different goals. The first tests we did were looking to identify spatial clustering within the raw data. The test performed after modeling seeks to identify any spatial structure leftover after creating our model.

2.4.1.4 Fit the SAR model

The lagsarlm() from the spatialreg package will fit the SAR for us, allowing us to specify the spatial weights matrix, states_sw, that we calculated all the way back here.

## SAR (spatial lag model)
SAR <- lagsarlm(bc_PredictedReadmissionRate ~ ., data = sarTrain, 
                      listw = states_sw)

summary(SAR)
## 
## Call:lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain, 
##     listw = states_sw)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.657380 -0.240268 -0.011297  0.205659  0.748405 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                                                                 Estimate
## (Intercept)                                                    0.0037668
## MRSA.Bacteremia                                               -0.1085416
## Abdomen.CT.Use.of.Contrast.Material                           -0.0323645
## Death.rate.for.pneumonia.patients                             -0.0659404
## Postoperative.respiratory.failure.rate                        -0.5704259
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  0.9855298
## Composite.patient.safety                                      -0.3809309
## Medicare.spending.per.patient                                  0.4864549
## Healthcare.workers.given.influenza.vaccination                 0.0336501
## Left.before.being.seen                                         0.0319323
## Venous.Thromboembolism.Prophylaxis                            -0.1223090
## Doctor.communication                                           0.3979458
## Communication.about.medicines                                 -0.2554363
## Discharge.information                                         -0.2890223
## Cleanliness                                                   -0.2588597
## Quietness                                                     -0.3024060
## Hospital.return.days.for.pneumonia.patients                    0.0927036
##                                                               Std. Error
## (Intercept)                                                    0.0803023
## MRSA.Bacteremia                                                0.2512056
## Abdomen.CT.Use.of.Contrast.Material                            0.1629524
## Death.rate.for.pneumonia.patients                              0.2029840
## Postoperative.respiratory.failure.rate                         0.3537272
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  0.2706225
## Composite.patient.safety                                       0.2174500
## Medicare.spending.per.patient                                  0.1476588
## Healthcare.workers.given.influenza.vaccination                 0.1160326
## Left.before.being.seen                                         0.1466825
## Venous.Thromboembolism.Prophylaxis                             0.2325981
## Doctor.communication                                           0.2050351
## Communication.about.medicines                                  0.2433874
## Discharge.information                                          0.1856011
## Cleanliness                                                    0.2002415
## Quietness                                                      0.1290785
## Hospital.return.days.for.pneumonia.patients                    0.1329591
##                                                               z value  Pr(>|z|)
## (Intercept)                                                    0.0469 0.9625866
## MRSA.Bacteremia                                               -0.4321 0.6656812
## Abdomen.CT.Use.of.Contrast.Material                           -0.1986 0.8425655
## Death.rate.for.pneumonia.patients                             -0.3249 0.7452908
## Postoperative.respiratory.failure.rate                        -1.6126 0.1068281
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  3.6417 0.0002708
## Composite.patient.safety                                      -1.7518 0.0798066
## Medicare.spending.per.patient                                  3.2945 0.0009861
## Healthcare.workers.given.influenza.vaccination                 0.2900 0.7718120
## Left.before.being.seen                                         0.2177 0.8276656
## Venous.Thromboembolism.Prophylaxis                            -0.5258 0.5990004
## Doctor.communication                                           1.9409 0.0522745
## Communication.about.medicines                                 -1.0495 0.2939458
## Discharge.information                                         -1.5572 0.1194175
## Cleanliness                                                   -1.2927 0.1961020
## Quietness                                                     -2.3428 0.0191393
## Hospital.return.days.for.pneumonia.patients                    0.6972 0.4856566
## 
## Rho: 0.33291, LR test value: 4.7905, p-value: 0.028617
## Asymptotic standard error: 0.13262
##     z-value: 2.5102, p-value: 0.012065
## Wald statistic: 6.3013, p-value: 0.012065
## 
## Log likelihood: -12.75033 for lag model
## ML residual variance (sigma squared): 0.095777, (sigma: 0.30948)
## Number of observations: 49 
## Number of parameters estimated: 19 
## AIC: NA (not available for weighted model), (AIC for lm: 66.291)
## LM test for residual autocorrelation
## test value: 0.05318, p-value: 0.81762

2.4.1.5 Interpret the SAR model

The summary() output of the SAR model shows a lot going on. Let’s go through the three key pieces of evidence we want to assess from these results.

  1. Spatial Autoregressive Coefficient (\(\rho\), “Rho”)

Here, we find that \(\rho = 0.33291\) with an estimated p-value of \(p = 0.012065\) based on a Z-statistic of \(Z = 2.5102\). What the Wald Z-test is doing is testing whether \(\rho = 0\), just like in Pearson correlation tests!

CONCLUSION: With a p-value LESS than 0.05, we would conclude that \(p \ne 0\) - thus there is sufficient spatial dependency in the target variable to justify including the lag term in this model. That’s opposite what the Moran’s \(I\) on the residuals implied but feels very consistent with what we found for both global and local Moran’s \(I\) earlier.

  1. Likelihood Ratio Test

More generally, Likelihood Ratio Tests (LRTs) are used to compare well one nested regression model fits compared to an unnested model. If you’ve ever done stepwise regression, you’ve inherently done a Likelihood Ratio Test possibly without knowing it! Note that because SAR is a nested model, it reduces to OLS when the spatial lag coefficient \(\rho = 0\). When there is NO spatial lag (\(\rho = 0\)), the first term of the SAR model cancels out!

The lagsarlm() unfortunately will not automatically compare the Spatial Lag Model (SAR) with the Ordinary Least Squares (OLS) model - but we can. We will do this by separately performing a Likelihood Ratio Test (LRT). But the LRT test is very simple. The formula for the likelihood ratio (LR) of the two models can be found by:

\(LR = 2 \times (\log L_{SAR} - \log L_{OLS})\)

where \(\log L_{SAR}\) is the log-likelihood of the SAR model and \(\log L_{OLS}\) of the OLS model. The LR test statistic follows a \(\chi^2\) distribution with 1 degree of freedom.

But if that feels too mathy for your, just remember this: A LR of 0 means there is NO difference between the two models being compared, but the bigger the LR the higher the probability that the two models are different. Thus, our \(H_0\) (null hypothesis) is that there is no difference between the SAR model and the OLS. So, if our \(p\)-value is less than 0.05, we will reject that null hypothesis.

In the output, the \(\log L_{SAR} = -12.75\) but the \(\log L_{OLS}\) isn’t shown. We can very quickly perform the LRT in R using the anova() function. NOTE that this function is confusingly named here; it will either do an Analysis of Variance (ANOVA) or an Analysis of Deviance, also known as the Likelihood Ratio Test which is what we are doing here!

## Give it the larger model first, then the null model.
## SAR is our larger model; the null OLS is nested within the SAR because
## the SAR = OLS when rho = 0!
anova(SAR, OLS)
##     Model df    AIC  logLik Test L.Ratio  p-value
## SAR     1 19 63.501 -12.750    1                 
## OLS     2 18 66.291 -15.146    2  4.7905 0.028616

This tells us that the log-likelihood of SAR (\(\log L_{SAR}\)) is -12.75, the log-likelihood of OLS is (\(\log L_{OLS}\)) is -15.146, and the likelihood ratio (LR) statistic is 4.7905 with an estimated \(p\)-value of 0.028616. Although it does not say this, the degrees of freedom is 1 here: 19 df from SAR - 18 df from OLS.

Thus, if we wanted to do it by “hand” (i.e., calculate it according to the formula), it would look like this:

## Likelihood ratio test by "hand"

## Extract the log-likelihoods for each model
logLik_OLS <- logLik(OLS)
logLik_SAR <- logLik(SAR)

## Calculate the likelihood ratio (LR) - make sure to use as.numeric()!
LR <- 2 * (as.numeric(logLik_SAR) - as.numeric(logLik_OLS))
## Find the estimated p-value for the LR statistic using the chi-squared 
## distribution. Use the df = 19-18 that we observed above. 
p <- pchisq(LR, df = 1, lower.tail = FALSE)

## Print the results!
cat("The likelihood ratio LR =", LR, "with 1 degree of freedom has p =", p)
## The likelihood ratio LR = 4.790534 with 1 degree of freedom has p = 0.02861655

Which is the same as what the anova() function gave us!

CONCLUSION: With a \(p\)-value LESS than 0.05, we would conclude that there is a difference between the two models. The more complex model, SAR, adds sufficient information to the model to help its fit. Thus, there is sufficient evidence that a spatial model improves our predictive ability over what an OLS alone would. Again, that’s opposite what the Moran’s \(I\) on the residuals implied but feels very consistent with what we found for both global and local Moran’s \(I\) earlier.

  1. Lagrange Multiplier (LM) Test for Residual Spatial Autocorrelation

The LM test is based on the spatially lagged residuals and compares the residual spatial pattern to what would be expected under independence - meaning that it’s similar to the test we did earlier when we tested whether there was a spatial pattern to the OLS residuals! These are very similar tests, albeit calculated in slightly different ways.

The null hypothesis (\(H_0\)) again is that there is NO spatial pattern to the residuals; the difference now is that we are testing if there’s still “leftover” spatial autocorrelation in the residuals AFTER fitting our SAR model. In other words - did the SAR model do a sufficiently good job at dealing with spatial autocorrelation? If yes, yay! We’re done! If no, we’d need to do a Spatial Error Model (SEM) or Spatial Durbin Model (SDM), the other two types of regression I alluded to wayyyy back at the beginning of this section.

Our LM test statistic, per the summary() output, is 0.05318 with a \(p\)-value of 0.81762.

CONCLUSION: With a \(p\)-value MORE than 0.05, we could conclude that there is NO residual spatial autocorrelation after fitting the SAR. Thus, the SAR model is sufficient for explaining the spatial autocorrelation in the target!

Question 18: [3 points]

Briefly summarize everything we know at this point: (1) Global Moran’s \(I\) results, (2) Local Moran’s \(I\) (LISA) results, (3) Moran’s \(I\) on the residuals of the OLS model, (4) the test for spatial autoregression using the coefficient \(\rho\), (5) the Likelihood Ratio Test (LRT), and (6) the LM test for residual spatial autocorrelation after fitting the SAR model. DO NOT simply copy my interpretations but instead try to make sure you understand each of these pieces of evidence and what they tell us.

  1. From the Global Moran’s I results, we see that the data produces a Moran’s I of 0.268 and a p-value of 0.0014. The purpose of the Global Moran’s I is to test if our data shows spatial autocorrelation. Since our p-value is below 0.05, we reject the null hypothesis. We can conclude that our data has significant positive spatial autocorrelation.
  1. Our purpose of looking at LISA results is to see if we can identify where the spatial clusters are forming, and what they mean. From the results, we identified two groups. One was a coldspot cluster that consisted of Idaho, Wyoming, Nebraska, Utah, and Colorado. The other was a hotspot cluster that consisted of Ohio, Pennsylvania, Virginia, and Maryland. There was also one identified low-high outlier, being Indiana. This demonstrated an interesting cold/hotspot grouping towards the west and east sides of the country respectively.
  1. The Moran’s I for the residuals of the OLS model obtained a Moran’s I of 0.129 and a p-value of 0.06133. Since the p-value is greater than 0.05, we fail to reject the null hypothesis and cannot conclude that there is spatial autocorrelation in the model residuals.
  1. The test for spatial autoregression using the coefficient rho also looks at spatial dependency within the model. However, counter to our previous conclusion, our test obtained a p-value of 0.012065, suggesting that there is in fact spatial dependency in the model.
  1. The Likelihood Ratio Test is used to determine if a spatial model would be more effective than just an OLS model. Our LRT obtained a p-value below 0.05, suggesting that a spatial model does in fact improve our prediction ability when compared to an OLS model.
  1. The LM test for Residual Spatial Autocorrelation assesses if there is a spatial pattern within our residuals after fitting the SAR model. The test obtained a p-value greater than 0.05, telling us that there is no spatial pattern in the model residuals. This suggests that the SAR model properly accounts for the spatial autocorrelation in the target.

What is your final takeaway? Do we need to account for spatial autocorrelation in our regression model?

Yes, we do need to account for spatial autocorrelation in our regression model. The first two tests demonstrate the precence of spatial autocorrelation within the data, even pinpointing the location of spatial clusters. Then, we were able to demonstrate that the spatial model outperformed the OLS model and sufficiently accounted for the autocorrelation.

2.4.1.6 Inferential statistics - which predictors are important?

Table 7. Spatial Lag (SAR) Regression results.
Coefficient Beta1 SE p-value
rho 0.33* 0.133 0.012
(Intercept) 0.00 0.080 >0.9
MRSA.Bacteremia -0.11 0.251 0.7
Abdomen.CT.Use.of.Contrast.Material -0.03 0.163 0.8
Death.rate.for.pneumonia.patients -0.07 0.203 0.7
Postoperative.respiratory.failure.rate -0.57 0.354 0.11
Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.99*** 0.271 <0.001
Composite.patient.safety -0.38 0.217 0.080
Medicare.spending.per.patient 0.49*** 0.148 <0.001
Healthcare.workers.given.influenza.vaccination 0.03 0.116 0.8
Left.before.being.seen 0.03 0.147 0.8
Venous.Thromboembolism.Prophylaxis -0.12 0.233 0.6
Doctor.communication 0.40 0.205 0.052
Communication.about.medicines -0.26 0.243 0.3
Discharge.information -0.29 0.186 0.12
Cleanliness -0.26 0.200 0.2
Quietness -0.30* 0.129 0.019
Hospital.return.days.for.pneumonia.patients 0.09 0.133 0.5
Abbreviations: CI = Confidence Interval, SE = Standard Error
1 *p<0.05; **p<0.01; ***p<0.001

Question 19: [2 points]

We can think of dealing with spatial spillover as helping us distinguish the wheat from the chaff (truth from noise). Interpret which of the coefficients are significant (\(p\)-value < 0.05) AND compare that to both Table 10 (parsimonious OLS model results) AND Figure 20 (Feature Importance from Elastic Net) from Project 1. Now that we are accounting for spatial autocorrelation, which predictors are still important? Did any become significant (important) now that weren’t before?

The following coefficients are significant, as they have a p-value < 0.05:

rho Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate Medicare.spending.per.patient Quietness

Comparing to what was found in Project 1, only Medicare spending and Quietness remain significance. In addition, Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate has gained signifiance.

2.4.1.6 Inferential statistics - which predictors are important?

Recall that our goal for our stakeholder is to be able to predict(!) which hospitals perform better than others. They want to make a deployable tool for patients/customers. So, let’s test the predictive ability of our model by using our testing data to calculate the \(RMSE\).

  1. First, join the testing data with the shapefile, then create a test dataset suitable for SAR:

This is called “spatially embedding” our testing data. Notice that we are using the spatial weights from the training set to prevent data leakage!!

## Join the shapefile with spatial weights and neighbor lists back to the 
## testing data
states_sf_AggTest <- left_join(states_sf, stateAggTest, 
                                by = c("ID" = "region"))

            ## Start from the testing data
sarTest <- states_sf_AggTest %>% 
  ## Make it a dataframe/tibble
  as_data_frame() %>% 
  ## Move the ID column to the rownames
  column_to_rownames("ID") %>% 
  ## Drop the features that we dropped before the analysis in Project 1, as 
  ## well as the features we do NOT want to use as predictors!
  select(-Median_RawPredictedReadmissionRate, 
         -Median_RawMedicareSpending, 
         -Median_RawHospitalReturnDays, 
         -Median_RawDeathRate,
         -geom) %>% 
  ## Have to drop any missing values from the testing data
  drop_na()
  1. Now, perform the predictions using the testing data and the SAR model we already fit:

The spatialreg package has its own version of the predict function, which will take the SAR model, the testing data, and the spatial weights (states_sw).

predictions <- predict(SAR, newdata = sarTest, listw = states_sw) %>% as_tibble()
## Returns as a list; so make sure to turn into a dataframe for convenience
Table 8. First 6 rows of SAR predictions using the testing data.
State fit trend signal
alabama -0.29 -0.36 0.07
arizona 0.31 0.49 -0.19
arkansas -0.32 -0.32 0.01
california 0.30 0.49 -0.19
colorado -0.75 -0.50 -0.25
connecticut 0.85 0.51 0.34

We can see that the predictions contain the following components:

  • fit: the predicted outcome \(\hat{y}\); here, this includes the influence of the spatial lag!
  • trend: the combination of linear predictor(s) for each value of the target \(y\); here, this is a combination of all \(\beta_1X_1 + ... \beta_nX_n\) in our model - so all of the predictors and their coefficients
  • signal: the spatial term, \(\rho Wy\); this is the influence that the spatial autocorrelation has on each value of the target \(y\)
  1. Calculate the RMSE:

We will use the fit component from the predictions to calculate RMSE.

## Calculate RMSE using the `fit` values
sarRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - predictions$fit)^2))
  1. Compare that with the RMSE from the OLS:

To be able to determine how much better a job its doing (if any), we need to compare the RMSE of the new OLS (not the one from Project 1) to the RMSE from the SAR model we just calculated.

olsPredictions <- predict(OLS, newdata = sarTest)
olsRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - olsPredictions)^2))
Table 9. Comparison of RMSE for OLS and SAR models.
Model RMSE
OLS 1.33
SAR (Spatial Lag) 1.15

Question 20A: [0.5 points]

Why is it challenging to interpret this RMSE in terms of the original/raw median pneumonia-related hospital readmissions? (I know.. I know… you’ve gotten it by now. Just checking.)

This RMSE is telling us the models performance at predicting the transformed data, which differs from the raw median readmissions.

So, let’s add the Mean Absolute Error (MAE) as well, which may be easier to interpret.

## OLS MAE
olsMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - olsPredictions))

## SAR MAE
sarMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - predictions$fit))
Table 10. Comparison of RMSE for OLS and SAR models.
Model RMSE MAE
OLS 1.33 0.84
SAR (Spatial Lag) 1.15 0.79

Question 20B: [1.5 points]

Interpret the MAE and the RMSE. Which model is better? Don’t forget to interpret the RMSE in terms of the transformed target!

The RMSE for the OLS model is 1.33, while the RMSE for the SAR is 1.15. This tells us that on average, the SAR model is closer to the actual values of the transformed target than the OLS model is.

The MAE of the OLS model is 0.84, while the MAE for the SAR is 0.79. This tells us that the absolute error is lower by 0.05 for the SAR model than the OLS model.

Overall, both measures suggest that the SAR model is better at predicting the transformed target than the RMSE model.

2.4.2 Other Types of Spatial Regression Models

The results of our SAR showed that we did not need to worry about any other types of spatial regression because the LM Test for spatial autocorrelation of the residuals was not significant. Still, I want to take a brief moment to explain what those models are and how we do them, using our current data as an example. They are important enough alternatives to the SAR we may need them in the future!

In both of these cases, you’d consider these if your LM Test for residual spatial autocorrelation was significant.

2.4.2.1 Spatial Error Model

We’ve discussed how the SAR only models spatial “spillover” into the target, but we’ve also peeked at whether we have spatial spillover in the residuals (errors). Why would we care? One of the fundamental assumptions of an OLS regression is independence of our error terms, which is something we typically do not test for because we assume we have it based on our research/experiment/analytical design. But the moment we have spatial dependence, well, we may potentially violate this key assumption!

In our case, after controlling for all known predictors and modeling spatial dependence in the target only, we were able to uphold the assumption of independence (that’s what the LM test showed). But sometimes, even after SAR, our data may still show similar residual patterns. At such a point, the next step would be to fit a Spatial Error Model (SEM).

The general structure of the SEM is:

\(y = + \lambda W \epsilon + \beta_1 X_1 + ... +\beta_n X_n + E\)

where \(y\) is the target variable (median pneumonia-related readmissions), \(W\) is the spatial weights matrix, which defines the neighborhoods (stored in states_sw!), and \(\lambda\) is the _spatial error coefficient__. This means that the first term of the equation, \(\lambda W \epsilon\), measures the spatial correlation between our errors. So, if \(\lambda = 0\), the spatial correlation between the errors isn’t enough to bias the standard errors. As with SAR, the remaining terms, \(\beta_1 X_1 + \beta_2 X_2 + ... \beta_n X_n\) are just the standard regression coefficients for each of the \(n\) predictors. \(E\) is an unmeasured error term, which is now the overall error leftover after fitting the SEM regression.

Fitting the model is pretty easy using the errorsarlm() function:

## Fit the SEM (spatial error model) using the training data and weights matrix
SEM <- errorsarlm(bc_PredictedReadmissionRate ~ ., data = sarTrain, 
                        listw = states_sw)

## Uncomment to view results!
summary(SEM)
## 
## Call:errorsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain, 
##     listw = states_sw)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.627137 -0.230742 -0.030906  0.212419  0.731746 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                                                                 Estimate
## (Intercept)                                                    0.0040570
## MRSA.Bacteremia                                               -0.1927576
## Abdomen.CT.Use.of.Contrast.Material                           -0.0875032
## Death.rate.for.pneumonia.patients                             -0.2111690
## Postoperative.respiratory.failure.rate                        -0.2147449
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  0.7302127
## Composite.patient.safety                                      -0.3182179
## Medicare.spending.per.patient                                  0.4806595
## Healthcare.workers.given.influenza.vaccination                -0.0044689
## Left.before.being.seen                                         0.1297515
## Venous.Thromboembolism.Prophylaxis                            -0.2324866
## Doctor.communication                                           0.4387042
## Communication.about.medicines                                 -0.0940048
## Discharge.information                                         -0.5370302
## Cleanliness                                                   -0.1512913
## Quietness                                                     -0.2218475
## Hospital.return.days.for.pneumonia.patients                    0.0842956
##                                                               Std. Error
## (Intercept)                                                    0.1240457
## MRSA.Bacteremia                                                0.2275375
## Abdomen.CT.Use.of.Contrast.Material                            0.1472948
## Death.rate.for.pneumonia.patients                              0.2039855
## Postoperative.respiratory.failure.rate                         0.3005054
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  0.2566630
## Composite.patient.safety                                       0.2052205
## Medicare.spending.per.patient                                  0.1430047
## Healthcare.workers.given.influenza.vaccination                 0.1202604
## Left.before.being.seen                                         0.1361761
## Venous.Thromboembolism.Prophylaxis                             0.2140493
## Doctor.communication                                           0.1937198
## Communication.about.medicines                                  0.2296558
## Discharge.information                                          0.1868993
## Cleanliness                                                    0.1993748
## Quietness                                                      0.1619164
## Hospital.return.days.for.pneumonia.patients                    0.1412999
##                                                               z value  Pr(>|z|)
## (Intercept)                                                    0.0327 0.9739093
## MRSA.Bacteremia                                               -0.8471 0.3969133
## Abdomen.CT.Use.of.Contrast.Material                           -0.5941 0.5524662
## Death.rate.for.pneumonia.patients                             -1.0352 0.3005683
## Postoperative.respiratory.failure.rate                        -0.7146 0.4748486
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  2.8450 0.0044408
## Composite.patient.safety                                      -1.5506 0.1209940
## Medicare.spending.per.patient                                  3.3611 0.0007762
## Healthcare.workers.given.influenza.vaccination                -0.0372 0.9703570
## Left.before.being.seen                                         0.9528 0.3406804
## Venous.Thromboembolism.Prophylaxis                            -1.0861 0.2774190
## Doctor.communication                                           2.2646 0.0235353
## Communication.about.medicines                                 -0.4093 0.6822981
## Discharge.information                                         -2.8734 0.0040612
## Cleanliness                                                   -0.7588 0.4479551
## Quietness                                                     -1.3701 0.1706445
## Hospital.return.days.for.pneumonia.patients                    0.5966 0.5507929
## 
## Lambda: 0.61363, LR test value: 5.9309, p-value: 0.014877
## Asymptotic standard error: 0.12394
##     z-value: 4.9512, p-value: 0.00000073764
## Wald statistic: 24.514, p-value: 0.00000073764
## 
## Log likelihood: -12.18013 for error model
## ML residual variance (sigma squared): 0.086126, (sigma: 0.29347)
## Number of observations: 49 
## Number of parameters estimated: 19 
## AIC: 62.36, (AIC for lm: 66.291)

Question 21A: [1 point]

Test the null hypothesis that \(\lambda = 0\), i.e., that there is spatial autocorrelation in the residuals. What p-value do you get and what conclusions do you make?

The LR test produces a p-value of 0.014877, which is below 0.05. Therefore, we reject the null hypothesis. There is spatial autocorrelation in the residuals.

WHY might you find a significant SEM but the LM test from the SAR is NOT significant? What does that suggest about our SAR model? (This is not a trick question, we’ve already discussed it.)

SEM is looking at if there is spatial autocorrelation within the error, which differs from SAR, which looks at spatial spillover in the target.
Our results suggest that the assumption of error term independence is violated.

2.4.2.2 Spatial Durbin Model

The Spatial Durbin Model (SDM) extends the Spatial Lag Model (SAR) by incorporating spatial lags of both the target variable AND the predictors. The idea here is that not there may be spatial autocorrelation in not only the target but also some (if not all) of the predictors. For example, if there was a policy or some underlying demography that affected the pneumonia-related hospital readmissions of one state that ALSO influenced that of a neighboring state, and we had measured that as a predictor (e.g., poverty rates or the number of elderly in the population), then an SDM would be a good idea provided that you had a significant LM test.

The general structure of the SDM is:

\(y = \rho Wy + \{\beta_1 X_1 + ... +\beta_n X_n\} + \{W_1X_1\theta_1 + ... + W_nX_n\theta_n\} + \epsilon\)

where \(y\) is the target variable (median pneumonia-related readmissions), \(W\) is the spatial weights matrix, which defines the neighborhoods (stored in states_sw!), and \(\rho\) is the spatial lag coefficient, making the first term of the equation, \(\rho Wy\), the same as it was in SAR! The next term, \(\beta_1 X_1 + \beta_2 X_2 + ... \beta_n X_n\) are just the standard regression coefficients for each of the \(n\) predictors. Then comes the third term, \(\{W_1X_1\theta_1 + ... + W_nX_n\theta_n\}\), which represents EACH of the spatially-lagged predictors we are including. Note that we would not have to spatially lag all predictors. As before, \(\epsilon\) is an unmeasured error term.

So, just by looking at all that mathy-math, we can see this model is a lot more complicated! The advantage, however, is that it would allow us to account for spatial autocorrelation in the target and predictors simultaneously, which is immensely powerful. And I like power! (Statistical power, that is.)

Fitting the model is pretty easy using the lagsarlm() function:

SDM <- lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain, 
  listw = states_sw, type = "mixed") ## "mixed" is necessary to specify as SDM

## Uncomment to view results!
summary(SDM)
## 
## Call:lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain, 
##     listw = states_sw, type = "mixed")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.5119296 -0.0856881  0.0028116  0.1329395  0.4811692 
## 
## Type: mixed 
## Coefficients: (asymptotic standard errors) 
##                                                                    Estimate
## (Intercept)                                                       -0.753483
## MRSA.Bacteremia                                                    0.181542
## Abdomen.CT.Use.of.Contrast.Material                                0.044127
## Death.rate.for.pneumonia.patients                                  0.135721
## Postoperative.respiratory.failure.rate                            -1.430492
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate      1.596331
## Composite.patient.safety                                          -0.395633
## Medicare.spending.per.patient                                      0.841922
## Healthcare.workers.given.influenza.vaccination                     0.104029
## Left.before.being.seen                                            -0.025737
## Venous.Thromboembolism.Prophylaxis                                 0.399788
## Doctor.communication                                               0.335880
## Communication.about.medicines                                      0.391737
## Discharge.information                                             -0.598922
## Cleanliness                                                       -0.501820
## Quietness                                                         -0.353278
## Hospital.return.days.for.pneumonia.patients                        0.159403
## lag.MRSA.Bacteremia                                                1.556060
## lag.Abdomen.CT.Use.of.Contrast.Material                            0.234453
## lag.Death.rate.for.pneumonia.patients                              0.818625
## lag.Postoperative.respiratory.failure.rate                        -3.074017
## lag.Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  2.572557
## lag.Composite.patient.safety                                       0.379481
## lag.Medicare.spending.per.patient                                  0.900775
## lag.Healthcare.workers.given.influenza.vaccination                 0.146529
## lag.Left.before.being.seen                                        -0.648498
## lag.Venous.Thromboembolism.Prophylaxis                             1.813592
## lag.Doctor.communication                                          -1.028270
## lag.Communication.about.medicines                                  0.476890
## lag.Discharge.information                                          1.418513
## lag.Cleanliness                                                   -0.607748
## lag.Quietness                                                     -0.821707
## lag.Hospital.return.days.for.pneumonia.patients                    0.518370
##                                                                   Std. Error
## (Intercept)                                                         0.135373
## MRSA.Bacteremia                                                     0.226457
## Abdomen.CT.Use.of.Contrast.Material                                 0.145975
## Death.rate.for.pneumonia.patients                                   0.183256
## Postoperative.respiratory.failure.rate                              0.396346
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate       0.227073
## Composite.patient.safety                                            0.194300
## Medicare.spending.per.patient                                       0.134421
## Healthcare.workers.given.influenza.vaccination                      0.132836
## Left.before.being.seen                                              0.131627
## Venous.Thromboembolism.Prophylaxis                                  0.325253
## Doctor.communication                                                0.219560
## Communication.about.medicines                                       0.224948
## Discharge.information                                               0.195077
## Cleanliness                                                         0.224143
## Quietness                                                           0.165374
## Hospital.return.days.for.pneumonia.patients                         0.126643
## lag.MRSA.Bacteremia                                                 0.459492
## lag.Abdomen.CT.Use.of.Contrast.Material                             0.332267
## lag.Death.rate.for.pneumonia.patients                               0.504535
## lag.Postoperative.respiratory.failure.rate                          0.793894
## lag.Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate   0.730970
## lag.Composite.patient.safety                                        0.642436
## lag.Medicare.spending.per.patient                                   0.363571
## lag.Healthcare.workers.given.influenza.vaccination                  0.275065
## lag.Left.before.being.seen                                          0.380962
## lag.Venous.Thromboembolism.Prophylaxis                              0.698519
## lag.Doctor.communication                                            0.439437
## lag.Communication.about.medicines                                   0.740221
## lag.Discharge.information                                           0.426730
## lag.Cleanliness                                                     0.600338
## lag.Quietness                                                       0.256628
## lag.Hospital.return.days.for.pneumonia.patients                     0.417160
##                                                                   z value
## (Intercept)                                                       -5.5660
## MRSA.Bacteremia                                                    0.8017
## Abdomen.CT.Use.of.Contrast.Material                                0.3023
## Death.rate.for.pneumonia.patients                                  0.7406
## Postoperative.respiratory.failure.rate                            -3.6092
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate      7.0300
## Composite.patient.safety                                          -2.0362
## Medicare.spending.per.patient                                      6.2633
## Healthcare.workers.given.influenza.vaccination                     0.7831
## Left.before.being.seen                                            -0.1955
## Venous.Thromboembolism.Prophylaxis                                 1.2292
## Doctor.communication                                               1.5298
## Communication.about.medicines                                      1.7415
## Discharge.information                                             -3.0702
## Cleanliness                                                       -2.2388
## Quietness                                                         -2.1362
## Hospital.return.days.for.pneumonia.patients                        1.2587
## lag.MRSA.Bacteremia                                                3.3865
## lag.Abdomen.CT.Use.of.Contrast.Material                            0.7056
## lag.Death.rate.for.pneumonia.patients                              1.6225
## lag.Postoperative.respiratory.failure.rate                        -3.8721
## lag.Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  3.5194
## lag.Composite.patient.safety                                       0.5907
## lag.Medicare.spending.per.patient                                  2.4776
## lag.Healthcare.workers.given.influenza.vaccination                 0.5327
## lag.Left.before.being.seen                                        -1.7023
## lag.Venous.Thromboembolism.Prophylaxis                             2.5963
## lag.Doctor.communication                                          -2.3400
## lag.Communication.about.medicines                                  0.6443
## lag.Discharge.information                                          3.3241
## lag.Cleanliness                                                   -1.0123
## lag.Quietness                                                     -3.2019
## lag.Hospital.return.days.for.pneumonia.patients                    1.2426
##                                                                            Pr(>|z|)
## (Intercept)                                                       0.000000026066269
## MRSA.Bacteremia                                                           0.4227501
## Abdomen.CT.Use.of.Contrast.Material                                       0.7624312
## Death.rate.for.pneumonia.patients                                         0.4589309
## Postoperative.respiratory.failure.rate                                    0.0003071
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate     0.000000000002065
## Composite.patient.safety                                                  0.0417303
## Medicare.spending.per.patient                                     0.000000000376876
## Healthcare.workers.given.influenza.vaccination                            0.4335467
## Left.before.being.seen                                                    0.8449781
## Venous.Thromboembolism.Prophylaxis                                        0.2190116
## Doctor.communication                                                      0.1260692
## Communication.about.medicines                                             0.0816043
## Discharge.information                                                     0.0021393
## Cleanliness                                                               0.0251663
## Quietness                                                                 0.0326603
## Hospital.return.days.for.pneumonia.patients                               0.2081464
## lag.MRSA.Bacteremia                                                       0.0007080
## lag.Abdomen.CT.Use.of.Contrast.Material                                   0.4804273
## lag.Death.rate.for.pneumonia.patients                                     0.1046895
## lag.Postoperative.respiratory.failure.rate                                0.0001079
## lag.Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate         0.0004326
## lag.Composite.patient.safety                                              0.5547274
## lag.Medicare.spending.per.patient                                         0.0132278
## lag.Healthcare.workers.given.influenza.vaccination                        0.5942364
## lag.Left.before.being.seen                                                0.0887056
## lag.Venous.Thromboembolism.Prophylaxis                                    0.0094223
## lag.Doctor.communication                                                  0.0192852
## lag.Communication.about.medicines                                         0.5194105
## lag.Discharge.information                                                 0.0008869
## lag.Cleanliness                                                           0.3113745
## lag.Quietness                                                             0.0013651
## lag.Hospital.return.days.for.pneumonia.patients                           0.2140089
## 
## Rho: -0.3198, LR test value: 1.8085, p-value: 0.17868
## Asymptotic standard error: 0.18114
##     z-value: -1.7655, p-value: 0.077472
## Wald statistic: 3.1171, p-value: 0.077472
## 
## Log likelihood: 14.78912 for mixed model
## ML residual variance (sigma squared): 0.031295, (sigma: 0.1769)
## Number of observations: 49 
## Number of parameters estimated: 35 
## AIC: NA (not available for weighted model), (AIC for lm: 40.23)
## LM test for residual autocorrelation
## test value: 6.2317, p-value: 0.012548

Now, you might notice something a bit odd here. Now, the test of the \(H_0: \rho = 0\) is NOT significant (p = 0.17868) but the LM test IS significant (p = 0.012548), which is a complete reversal. What gives?!

The reality is that we likely have yet unaccounted for spatial structure in the data, in the form of OMITTED VARIABLES. Yikes! How can we deal with that if we have no idea what we omitted?!

TAKEAWAYS

  • The SAR did a “good enough” job at inference. We’ve muddied the waters a bit by doing the SDM. Sometimes “good enough” is totally okay! However, don’t forget that for prediction purposes, the SAR model is not going to work.

  • There are very likely to be omitted spatial variables. We haven’t, for example, included anything about underlying demographics or Social Determinants of Health.

  • If we were determined to improve this, we would either

      1. Try a Spatial Durbin Error Model (SDEM) or
      1. Try to find the ommitted variables. We are going to go with option 2!

Question 21B: [1 point]

Briefly summarize a possible workflow, based on what we’ve done here as well as with help from other sources if needed, to use (1) Moran’s \(I\) and/or LISA, (2) SAR, (3) SEM, and (4) SDM. Table 8 (below) may help you or you’re free to refer to it to save writing. Feel free to write this is as numbered or bulleted list, if you’d like!

  1. Use Moran’s I and/or LISA to test for global/local spatial autocorrelation within the target or residuals
  1. Fit an OLS model to create an initial benchmark. Use Moran’s I/LISA to check for spatial autocorrelation and proceed if present.

Use the respective models under the listed conditions:

  1. If spatial dependence is in the target variable, fit a SAR model. Test for significance of Rho to confirm choice of model.
  1. If spatial dependence is in the residuals, fit a SEM model. Test for significance of Lambda to confirm choice of model.
  1. If spatial dependence is in both the target and predictors, fit a SDM model.

 

Table 11. Comparison of spatial regression models and when to use them.
Model Spatial Dependence In Use When
SAR (Spatial Lag/Autoregressive) Target variable Outcome in one area is influenced by neighboring outcomes
SEM (Spatial Error) Residuals \(\epsilon\) Unmeasured spatial effects or spatial error correlation
SDM (Spatial Durbin) Both target and predictors Capture spatial spillover of both outcomes and predictors

Question 22: [2 points]

DATA SCIENCE EXTENSIONS OF SPATIAL MODELING \(\rightarrow\) At this point, it should feel clear that we have spatial influence in our data with state-level differences in pneumonia-related hospital readmissions. Yet, building a robust predictive model is still not within our grasp. How could we get there?

Our next steps will involve moving in the direction of adding omitted spatially correlated features, but your first task is a thought experiment and a little research.

Do data scientists ever extract the spatially lagged targets/predictors and put them into machine learning models, like Elastic Net - or others like Random Forest or Neural Networks? Make sure to give an example from a scholarly or journalistic article.

HINT 1: Searching Google Scholar can be a starting point for academic articles.

HINT 2: The Harvard Data Science Review can be a solid starting point for articles that are more hybrid academic/journalistic.

HINT 3: Medium or LinkedIn can sometimes have decent how-to guides, but these articles are NOT validated in any way. If you choose an article from here, make sure you validate it with what we’ve done here and discussed in class!

Yes, it appears as if this practice is rather common in cases with spatial autocorrelation.

One example can be seen in the article “Random Forest and Gradient Boosting with Spatial Lags” (Kämäräinen et al., 2023). In this article, the researchers were looking to build models to predict carbon fluxes. The researchers incorporated spatially laggged predictors in random forest and gradient boosting models. These predictors included environmental facotrs like air tempurature and soil moisture which were compared to neighboring sites. The study demonstrated that the incorporation of such predictors increased the model’s accuracy.

From this, we can feel validated in our choice to include a spatially lagged predictor in our model for pneumonia readmissions.

Source:

Kämäräinen, M., Tuovinen, J.-P., Kulmala, M., Mammarella, I., Aalto, J., Vekuri, H., Lohila, A., and Lintunen, A.: Spatiotemporal lagging of predictors improves machine learning estimates of atmosphere–forest CO2 exchange, Biogeosciences, 20, 897–909, https://doi.org/10.5194/bg-20-897-2023, 2023.

Question 23: [2 points]

DATA SCIENCE EXTENSIONS OF SPATIAL MODELING \(\rightarrow\) Why is the state-level spatial dataset inappropriate for any further machine-learning? In other words, why are we currently restricted with the dataset exactly as it is to just the SAR, SEM, or SDM?

HINT: Check the dimensions of sarTrain if you need to!

dim(sarTrain)
## [1] 49 17

The train data is severly limited for the purposes of model building. 49 rows is a very small sample size for us to build predictive models with. Such few training examples will likely lead to models that are overfit, and therefore cannot generalize beyond the examples.

SAR, SEM, and SDM work well enough with this sample size, and are being used to test for spatial dependence. Other machine learning methods, however, will struggle with this sample size.

Question 24: [1 point]

DATA SCIENCE EXTENSIONS OF SPATIAL MODELING \(\rightarrow\) Is there another spatial structure to this dataset that we could explore for use with machine learning instead of state-level?

Working at the level of county, city, or zip code could provide for many more rows of data than the state-level aggregates.

3 Finding omitted spatially-autocorrelated variables

At this stage, we know that the SAR model is “good enough” - if we want to do inference only, that is! The SDM “hinted” at some likely omitted variables that are spatially correlated with the predictors in our dataset. The tricky this is - how do we figure out what we’ve omitted?

Perhaps you noticed, as I did, that some of the higher rates of pneumonia-related readmissions tend to be in denser regions, like D.C. New Jersey, or states with larger elderly populations, like Florida and California (see Table 4). Perhaps poverty rates are also important, like West Virginia or Kentucky. The point is, we could speculate about possible demographic or socially-determined causes of poor health outcomes, all of which are likely to have their own underlying spatial structure as well!

Question 25: [1 point]

What is a Social Determinant of Health (SDOH)? Where does this phrase come from?

A Social Determinant of Health is an environmental condition that can affect the overall health of those who inhabit the environment. The study of SDOHs originated in the late 19th century as a response to the rise of illness in the Industrial Age. However, the specific phrase and study came into mainstream relevance in the early 2000s as WHO pushed to use SDOH as a way to understand and treat health inequities across communities.

Sources:

https://odphp.health.gov/healthypeople/priority-areas/social-determinants-health

https://www.momsmeals.com/health-plans-aaa-state-governments/a-brief-history-of-social-determinants-of-health/

3.1 Pre-processing population-level data to test SDOH hypotheses

Let’s bring in some data from the 2023 US Census (these are projections based on the 2020 Census), as well from the Administration for Community Living (ACL) which had conveniently already compiled further data we could use from the US Census. The 2023 US Census Bureau projections were downloaded from here, and the ACL data were downloaded from here. For our convenience, I already put them into a csv we can import. The geographic measurements also come from the US Census Bureau, which I obtained here.

Note that, if our administrative data from the Centers for Medicare and Medicaid Services (CMS) that we’re using are from fiscal year 2024, then we would want to use the 2023 census data! This is because of how federal government fiscal years run.

## Read in, calculate the percent of the total adult population over 65,
## Make the states lowercase
## Calculate population density as hundreds of people per square mile

pop <- read_csv("2023_US_Census.csv", show_col_types = FALSE) %>% 
  mutate(over65PercentPop = over65Pop / adultPop,
         region = tolower(region),
         popDensity = (totalPop/100)/(landSqMi))      
Table 12. First 6 rows of US Census Bureau data.
region totalPop adultPop over65Pop percBelowPoverty landSqMi over65PercentPop popDensity
connecticut 3617176 2894190 663318 0.07 4842 0.23 7.47
district of columbia 678972 552380 87260 0.19 61 0.16 111.31
massachusetts 7001399 5659598 1260538 0.11 7800 0.22 8.98
new jersey 9290841 7280551 1611767 0.07 7354 0.22 12.63
puerto rico 3205691 2707012 756849 0.40 3424 0.28 9.36
rhode island 1095962 892124 206410 0.07 1034 0.23 10.60

Take a look at the original csv file. It included totalPop (includes children), adultPop (all individuals 16+ years), over65pop (all adults 65+ years), percBelowPoverty (percent of the total population living at or below the Poverty Index for 2023), and landSqMi (the land area in square miles). I then calculated the over65PercentPop as the percent of the population that is over the age of 65 relative to the total adult population over the age of 65. I also calculated popDensity, which is the population per land in square miles. This gives us a better sense of how dense vs. spread out a geographic area may be. Note too that these data have already been aggregated to the state-level.

Lastly, let’s join the census data with the training and testing data (states_sf_AggTrain and states_sf_AggTest) so that we have those data downstream for analysis.

## Join with the states aggregated shapefile trainig data from Section 2 
states_sf_AggTrain <- left_join(states_sf_AggTrain, pop, 
                            by = c("ID" = "region"))

## And do testing data as well:
states_sf_AggTest <- left_join(states_sf_AggTest, pop, 
                            by = c("ID" = "region"))

3.2 Testing Hypotheses: Readmissions by Population Density

I didn’t formally lay these out as hypotheses (statements that explain WHY something happens) that we could test, but I sure could! Let’s start by formulating and testing a hypothesis about population density.

Formulation of Hypothesis 1: States with high population density have higher median pneumonia-related readmissions because denser populations support more hospital facilities, are likelier to reflect more urban areas, and may even enable faster spread of pneumonia as compared to states with less dense populations.

We have a hypothesis now, but how can we test it? (Hypotheses will always be a “because” statement!)

3.2.1 Heuristic (Visual) Tests

A heuristic is a solution that we arrive at by trial-and-error or, very oftenly, by visual or some other kind of inspection. It’s different from statistical testing but often as valid because it allows us to explore the data. Every time we do Exploratory Data Analysis we’re doing heuristic assessment of our datasets!

So, let’s use our mapping skills to help us to assess whether we think there is any support for a correlation between population density and median pneumonia-related hospital readmissions. We will print four plots (one we’ve previously made, p1) for analysis.

Question 26: [1 point]

Add your comments to the code that makes the three new plots: p2, p3, and p4.

Answer in the comments.

## Make a map `p2` of population density:
p2 <- states_sf_AggTrain %>% 
# Filter out DC, because it is a city, not a state, and will skew the data.
  filter(ID != "district of columbia") %>%  ## <- Comment this out to see why!
# Use ggplot to create a map
ggplot() +
  # Map state boundaries, fill based on population density
  geom_sf(aes(fill = popDensity), color = "grey20", alpha = 1, size = 0.2) +
  # Set a color gradient for the population density fill
  scale_fill_gradient(
    name = "Hundreds per sq. mi.",
    low = "white", high = skittles[3],
    na.value = "grey90"
  ) +
  # Set a title and caption
  labs(title = "Adult population density by state",
    caption = "Source: U.S. Census Bureau, 2023") +
  # Set the theme for the map
  theme(
    panel.grid = element_blank(),
    panel.background = element_rect(fill = "white"),
    plot.background = element_rect(fill = "white"),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    legend.position = "bottom"
  ) +
  # Define the coordinate system for the map
  coord_sf()

## Make a scatterplot called `p3` with best-fit reference line between 
## population density and the pneumonia-related hospital readmissions

p3 <- states_sf_AggTrain %>% 
# Filter out DC, for same reason as above
  filter(ID != "district of columbia") %>%  ## <- Comment this out to see why!
# Use ggplot to create a scatterplot between popDensity and the Median Raw target 
ggplot(aes(x = popDensity, y = Median_RawPredictedReadmissionRate)) +
# Add points for each state, add a regression line with a confidence interval
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", se = T, color = skittles[3]) +
  theme_minimal() +
  # Set a title and axis labels
  labs(title = "Hospital readmissions by population density",
       x = "Hundreds per sq. mi.",
       y = "Predicted Pneumonia-related Hospital Readmissions")

## Make a barplot `p4` that shows the highest density states with their median
p4 <- states_sf_AggTrain %>% 
# Filter out DC, for same reason as above
  filter(ID != "district of columbia") %>%  ## <- Comment this out to see why!
  # Arrange data in descending order by population density
  arrange(desc(popDensity)) %>% 
  # Select the top 10 rows
  head(10) %>% 
  # Use ggplot to create a barplot
ggplot(aes(x = fct_reorder(ID, popDensity), y = Median_RawPredictedReadmissionRate)) +
  # Set fill color to skittles[3], set theme to minimal
  geom_col(fill = skittles[3]) +
  theme_minimal() + 
  # Make barplot horizontal by flipping axes
  coord_flip() +
  # Set a title, axis labels, and caption
  labs(title = "Top 10 Most Dense States",
     y = "Median % hospital readmissions",
     x = "",
     caption = "Source: CMS, 2024 & U.S. Census, 2023; n.b. Wash D.C. excluded")

Question 27A: [0.5 points]

Compare the new map, p2 of population density to the original map, p1. It may be hard to tell at first, but do you see a correlation between population density and median pneumonia-related admissions?

In many cases, it appears as if the more dense states are the states with a higher median readmission rate. This is most notable in states like California, Massachusetts, and Florida. Likewise, Wyoming, Utah, and Iowa are examples of states with low population densities and low median readmission rates.

Question 27B: [0.5 points]

Now look at the scatterplot, p3, and the column graph, p4. What conclusions can you make, if any, from these two graphs and why?

Overall, it appears that states with highly dense populations are more likely to have high readmission rates. While the correlation is somewhat weak, we can clearly identify several key states driving this pattern.

3.2.2 Adding Population Density to the SAR

Now, we can also calculate the spatial statistics using the updated dataset now that we’ve add population density. Let’s make a new sarTrain and sarTest so we can repeat the spatial regressions we did in Part 2.

3.2.2.1 Make the updated sarTrain and sarTest

## TRAINING
            ## Start from the training data
sarTrain <- states_sf_AggTrain %>% 
  ## Make it a dataframe/tibble
  as_data_frame() %>% 
  ## Drop the features that we dropped before the analysis in Project 1, as 
  ## well as the features we do NOT want to use as predictors!
  select(-Median_RawPredictedReadmissionRate, 
         -Median_RawMedicareSpending, 
         -Median_RawHospitalReturnDays, 
         -Median_RawDeathRate, 
         -ID,
         -localI, 
         -pValue, 
         -quadrant, 
         -geom, 
         -lag_medianReadmissions,
  ## Make sure to remove any correlated variables/variables we don't want to
  ## test from the Census data!!
         -totalPop,
         -adultPop,
         -over65Pop, 
         -percBelowPoverty,
         -landSqMi, 
         -over65PercentPop)

## TESTING
            ## Start from the testing data
sarTest <- states_sf_AggTest %>% 
  ## Make it a dataframe/tibble
  as_data_frame() %>% 
  ## Move the ID column to the rownames
  column_to_rownames("ID") %>% 
  ## Drop the features that we dropped before the analysis in Project 1, as 
  ## well as the features we do NOT want to use as predictors!
  select(-Median_RawPredictedReadmissionRate, 
         -Median_RawMedicareSpending, 
         -Median_RawHospitalReturnDays, 
         -Median_RawDeathRate,
         -geom,
         ## Make sure to remove any correlated variables/variables we don't want to
         ## test from the Census data!!
         -totalPop,
         -adultPop,
         -over65Pop, 
         -percBelowPoverty,
         -landSqMi, 
         -over65PercentPop) %>% 
  ## Have to drop any missing values from the testing data
  drop_na()

Question 28: [3 points]

We are about to walk through the spatial regressions again now that we have added popDensity to the dataset. You will need to answer an interpretation question after each code chunk!

3.2.2.2 Fit the OLS regression and test Moran’s I on the residuals

OLS <- lm(bc_PredictedReadmissionRate ~ ., data = sarTrain)
## Moran's I test on residuals
moran.test(residuals(OLS), states_sw)
## 
##  Moran I test under randomisation
## 
## data:  residuals(OLS)  
## weights: states_sw    
## 
## Moran I statistic standard deviate = 1.2912, p-value = 0.09832
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.10453179       -0.02083333        0.00942720

QUESTION: What does Moran’s \(I\) test on the residuals tell us about spatial autocorrelation of the residual (error) terms?

The Moran’s I statistic is 0.10453179 and the p-value is 0.09832, which is greater than 0.05. Therefore, we cannot reject the null hypothesis. There does not appear to be spatial autocorrelation in the residuals.

3.2.2.3 Fit the SAR and look at \(\rho\) and the LM test

## SAR (spatial lag model)
SAR <- lagsarlm(bc_PredictedReadmissionRate ~ ., data = sarTrain, 
                listw = states_sw)
## Uncomment to view results!
summary(SAR)
## 
## Call:lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain, 
##     listw = states_sw)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.5485252 -0.2280885  0.0078934  0.1806131  0.6826910 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                                                                 Estimate
## (Intercept)                                                   -0.0767219
## MRSA.Bacteremia                                               -0.0528671
## Abdomen.CT.Use.of.Contrast.Material                           -0.0067513
## Death.rate.for.pneumonia.patients                             -0.0389353
## Postoperative.respiratory.failure.rate                        -0.6229064
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  0.8092097
## Composite.patient.safety                                      -0.3584346
## Medicare.spending.per.patient                                  0.5284643
## Healthcare.workers.given.influenza.vaccination                -0.0060959
## Left.before.being.seen                                        -0.0698625
## Venous.Thromboembolism.Prophylaxis                            -0.1266912
## Doctor.communication                                           0.2548785
## Communication.about.medicines                                 -0.1214764
## Discharge.information                                         -0.1837904
## Cleanliness                                                   -0.3603474
## Quietness                                                     -0.3206796
## Hospital.return.days.for.pneumonia.patients                    0.2304177
## popDensity                                                     0.0091564
##                                                               Std. Error
## (Intercept)                                                    0.0876569
## MRSA.Bacteremia                                                0.2411199
## Abdomen.CT.Use.of.Contrast.Material                            0.1567272
## Death.rate.for.pneumonia.patients                              0.1950750
## Postoperative.respiratory.failure.rate                         0.3407801
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  0.2722906
## Composite.patient.safety                                       0.2085470
## Medicare.spending.per.patient                                  0.1431774
## Healthcare.workers.given.influenza.vaccination                 0.1132179
## Left.before.being.seen                                         0.1496479
## Venous.Thromboembolism.Prophylaxis                             0.2219348
## Doctor.communication                                           0.2078883
## Communication.about.medicines                                  0.2422115
## Discharge.information                                          0.1861147
## Cleanliness                                                    0.1987349
## Quietness                                                      0.1235582
## Hospital.return.days.for.pneumonia.patients                    0.1439392
## popDensity                                                     0.0044927
##                                                               z value  Pr(>|z|)
## (Intercept)                                                   -0.8753 0.3814368
## MRSA.Bacteremia                                               -0.2193 0.8264503
## Abdomen.CT.Use.of.Contrast.Material                           -0.0431 0.9656404
## Death.rate.for.pneumonia.patients                             -0.1996 0.8418002
## Postoperative.respiratory.failure.rate                        -1.8279 0.0675670
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  2.9719 0.0029600
## Composite.patient.safety                                      -1.7187 0.0856647
## Medicare.spending.per.patient                                  3.6910 0.0002234
## Healthcare.workers.given.influenza.vaccination                -0.0538 0.9570607
## Left.before.being.seen                                        -0.4668 0.6406102
## Venous.Thromboembolism.Prophylaxis                            -0.5708 0.5681023
## Doctor.communication                                           1.2260 0.2201851
## Communication.about.medicines                                 -0.5015 0.6159979
## Discharge.information                                         -0.9875 0.3233919
## Cleanliness                                                   -1.8132 0.0698000
## Quietness                                                     -2.5954 0.0094488
## Hospital.return.days.for.pneumonia.patients                    1.6008 0.1094214
## popDensity                                                     2.0380 0.0415450
## 
## Rho: 0.35443, LR test value: 5.7081, p-value: 0.016887
## Asymptotic standard error: 0.13153
##     z-value: 2.6946, p-value: 0.0070471
## Wald statistic: 7.2609, p-value: 0.0070471
## 
## Log likelihood: -10.75671 for lag model
## ML residual variance (sigma squared): 0.087935, (sigma: 0.29654)
## Number of observations: 49 
## Number of parameters estimated: 20 
## AIC: NA (not available for weighted model), (AIC for lm: 65.221)
## LM test for residual autocorrelation
## test value: 0.16512, p-value: 0.68448

QUESTION: What is the value of \(\rho\) and its \(p\)-value? Does it suggest that we have spatial dependence in our target variable still? Are you surprised? What about the LM test?

The value of rho is 0.35443 and its p-value is 0.016887, which is less than 0.05. This suggests that there is spatial dependence in our target variable. This does not surprise me, as it is exactly what we had seen before in our earlier model. The LM test has a test value of 0.16512 and a p-value of 0.68448, which is greater than 0.05. This suggests that there is not spatial autocorrelation in the residuals. These two scores together suggest that a SAR model is more appropriate than a SEM model, as we see signs of spatial spillover in the target but no spatial correlation in the residuals.

3.2.2.4 Compare the SAR to the OLS model with a Likelihood Ratio Test (LRT)

## Give it the larger model first, then the null model.
## SAR is our larger model; the null OLS is nested within the SAR because
## the SAR = OLS when rho = 0!
anova(SAR, OLS)
##     Model df    AIC  logLik Test L.Ratio  p-value
## SAR     1 20 61.513 -10.757    1                 
## OLS     2 19 65.221 -13.611    2  5.7081 0.016887

QUESTION: Which model is the better fitting model and why?

The SAR model has a lower AIC and a higher log-likelihood than the OLS model, suggesting that it is a better fitting model.

3.2.2.5 Inferential statistics

## Print a table using gtsummary
tbl_regression(SAR) %>% 
  modify_caption("Table 7. Spatial Lag (SAR) Regression results.") %>% 
  modify_header(label = "**Coefficient**") %>% 
  add_significance_stars(hide_p = FALSE)
Table 7. Spatial Lag (SAR) Regression results.
Coefficient Beta1 SE p-value
rho 0.35** 0.132 0.007
(Intercept) -0.08 0.088 0.4
MRSA.Bacteremia -0.05 0.241 0.8
Abdomen.CT.Use.of.Contrast.Material -0.01 0.157 >0.9
Death.rate.for.pneumonia.patients -0.04 0.195 0.8
Postoperative.respiratory.failure.rate -0.62 0.341 0.068
Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.81** 0.272 0.003
Composite.patient.safety -0.36 0.209 0.086
Medicare.spending.per.patient 0.53*** 0.143 <0.001
Healthcare.workers.given.influenza.vaccination -0.01 0.113 >0.9
Left.before.being.seen -0.07 0.150 0.6
Venous.Thromboembolism.Prophylaxis -0.13 0.222 0.6
Doctor.communication 0.25 0.208 0.2
Communication.about.medicines -0.12 0.242 0.6
Discharge.information -0.18 0.186 0.3
Cleanliness -0.36 0.199 0.070
Quietness -0.32** 0.124 0.009
Hospital.return.days.for.pneumonia.patients 0.23 0.144 0.11
popDensity 0.01* 0.004 0.042
Abbreviations: CI = Confidence Interval, SE = Standard Error
1 *p<0.05; **p<0.01; ***p<0.001

QUESTION: Did any of the predictors change significance from the first SAR we ran? What about popDensity? Is it significant, and how do you interpret its beta coefficient (the slope) AFTER adjusting for spatial dependence in readmission rate?

The significant predictors in this model are:

Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate Medicare.spending.per.patient Quietness popDensity

This lines up with our first SAR, with the addition of popDensity.

The beta coefficient tells us that a one unit increase in population density results in a 0.01 unit increase in readmission rate (on the scale of our transformed data).

3.2.2.6 Fit SEM and SDM

## Fit the SEM (spatial error model) using the training data and weights matrix
SEM <- errorsarlm(bc_PredictedReadmissionRate ~ ., data = sarTrain, 
                  listw = states_sw)
## Uncomment to view results!
## summary(SEM)

## Fit the SDM (spatial Durbin model) using the training data and weights matrix
SDM <- lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain, 
    listw = states_sw, type = "mixed") ## "mixed" is necessary to specify as SDM
## Uncomment to view results!
## summary(SDM)

3.2.2.6 Predictions using the testing data on SAR, SEM, and SDM

## Fit the predictions for the FOUR models using the testing data!
olsPredictions <- predict(OLS, newdata = sarTest)

sarPredictions <- predict(SAR, newdata = sarTest, listw = states_sw) %>% 
  as_tibble()

semPredictions <- predict(SAR, newdata = sarTest, listw = states_sw) %>% 
  as_tibble()

sdmPredictions <- predict(SAR, newdata = sarTest, listw = states_sw) %>% 
  as_tibble()

3.2.2.7 Calculate the \(RMSE\) and \(MAE\) and compare

## RMSE
olsRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - olsPredictions)^2))
sarRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - sarPredictions$fit)^2))
semRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - semPredictions$fit)^2))
sdmRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - sdmPredictions$fit)^2))

## MAE
olsMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - olsPredictions))
sarMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - sarPredictions$fit))
semMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - semPredictions$fit))
sdmMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - sdmPredictions$fit))

## Make a dataframe for comparison of RMSE and MAE values
data.frame(Model = c("OLS", "SAR (Spatial Lag)", "SEM (Spatial Error)", "SDM (Durbin)"),
           RMSE = round(c(olsRMSE, sarRMSE, semRMSE, sdmRMSE), 2),
           MAE = round(c(olsMAE, sarMAE, semMAE, sdmMAE),2)) %>% 
  ## Display with kable
  kable(digits = 2,
        format = "html",
        caption = "Table 13. Comparison of RMSE and MAE for all models.") %>%
  kable_styling(bootstrap_options = c("hover", full_width = F)
  )
Table 13. Comparison of RMSE and MAE for all models.
Model RMSE MAE
OLS 1.30 0.82
SAR (Spatial Lag) 1.09 0.74
SEM (Spatial Error) 1.09 0.74
SDM (Durbin) 1.09 0.74

QUESTION: What does the comparison of the four models’ \(RMSE\) and \(MAE\) imply? Which model is the superior model? Is SAR still “good enough”? Do you think that, despite adding population density, we’re likely STILL omitting spatial variables? Explain.

The comparison suggests that SAR, SEM, and SDM all outperform the OLS model. However, those three models are equal in performance when comparing RMSE and MAE. Considering what was concluded in 3.2.2.3, I would argue that SAR is in fact still “good enough”. There are no clear advantages seen by either SEM or SDM.

I believe it is probable that there are still omitted spatial variables. Going back to the concept of SDOH, population density is just one of many possible enironmental factors that can affect health. Including other SDOHs could potentially increase model performance.

3.3 Testing Another Hypothesis: Readmissions by ?

The time has finally come for you to choose your own adventure again! There are two more hypotheses we could test using spatially-influenced, social determinants of health (SDOH) that we imported with population density from the U.S. Census.

Question 29: [10 points]

Choose ONE of the two hypotheses to explore and, using the code from Section 3.2, walk through both the heuristic and spatial regression tests of your chosen hypothesis. The hypotheses have been provided for you this time.

  • Hypothesis 2: High Elderly Populations: States with a higher proportion of their adult population over the age of 65 have higher rates of readmission because of age-related factors that increase the likelihood of severe illness, like pneumonia, and increased risk of complications that could lead to higher hospital readmissions.

But these are not the only population-related hypotheses we could pose. Perhaps, instead, it has to do with poverty - perhaps states with higher poverty rates also tend to see lower rates of medical coverage, thereby leading hospitals to release patients too early to spare costs. Thus, we might also ask if states with higher poverty rates are more likely to see higher pneumonia readmissions?

  • Hypothesis 3: High Poverty Populations: Higher rates of pneumonia-related readmissions may be associated with higher proportions of the poverty living at or below the poverty threshold because those individuals may have poorer medical coverage, including non- and under-insuredness, or a stronger drive to try to return to work, both of which could in turn pressure some hospitals to discharge sick patients sooner than they otherwise would.

For full credit, walk through all of the steps from Section 3.2. Make sure to answer the interpretation questions, too!

Hypothesis 2

Graphical Exploration of the Data

## Make a map `p2` of over65PercentPop:
p2 <- states_sf_AggTrain %>% 
# Filter out DC, because it is a city, not a state, and will skew the data.
  filter(ID != "district of columbia") %>%  ## <- Comment this out to see why!
# Use ggplot to create a map
ggplot() +
  # Map state boundaries, fill based on over65PercentPop
  geom_sf(aes(fill = over65PercentPop), color = "grey20", alpha = 1, size = 0.2) +
  # Set a color gradient for the population density fill
  scale_fill_gradient(
    name = "Hundreds per sq. mi.",
    low = "white", high = skittles[3],
    na.value = "grey90"
  ) +
  # Set a title and caption
  labs(title = "Percent of population over 65 by state",
    caption = "Source: U.S. Census Bureau, 2023") +
  # Set the theme for the map
  theme(
    panel.grid = element_blank(),
    panel.background = element_rect(fill = "white"),
    plot.background = element_rect(fill = "white"),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    legend.position = "bottom"
  ) +
  # Define the coordinate system for the map
  coord_sf()

## Make a scatterplot called `p3` with best-fit reference line between 
## over65PercentPop and the pneumonia-related hospital readmissions

p3 <- states_sf_AggTrain %>% 
# Filter out DC, for same reason as above
  filter(ID != "district of columbia") %>%  ## <- Comment this out to see why!
# Use ggplot to create a scatterplot between over65PercentPop and the Median Raw target 
ggplot(aes(x = over65PercentPop, y = Median_RawPredictedReadmissionRate)) +
# Add points for each state, add a regression line with a confidence interval
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", se = T, color = skittles[3]) +
  theme_minimal() +
  # Set a title and axis labels
  labs(title = "Hospital readmissions by percent of population over 65",
       x = "Hundreds per sq. mi.",
       y = "Predicted Pneumonia-related Hospital Readmissions")

## Make a barplot `p4` that shows the highest over65PercentPop states with their median
p4 <- states_sf_AggTrain %>% 
# Filter out DC, for same reason as above
  filter(ID != "district of columbia") %>%  ## <- Comment this out to see why!
  # Arrange data in descending order by over65PercentPop
  arrange(desc(over65PercentPop)) %>% 
  # Select the top 10 rows
  head(10) %>% 
  # Use ggplot to create a barplot
ggplot(aes(x = fct_reorder(ID, over65PercentPop), y = Median_RawPredictedReadmissionRate)) +
  # Set fill color to skittles[3], set theme to minimal
  geom_col(fill = skittles[3]) +
  theme_minimal() + 
  # Make barplot horizontal by flipping axes
  coord_flip() +
  # Set a title, axis labels, and caption
  labs(title = "Top 10 States with percentage of population over 65",
     y = "Median % hospital readmissions",
     x = "",
     caption = "Source: CMS, 2024 & U.S. Census, 2023; n.b. Wash D.C. excluded")

Comparing p1 to p2, the connection between median readmission rate and percent of population over 65 does not appear to be as obvious as the connection between the target and population density.

In addition, the trend in p3 appears much less defined than we saw in the previous hypothesis.

Overall, the relation does not visually appear to be as meaningful as the previously tested one.

Adding over65PercentPop to the SAR

## TRAINING
            ## Start from the training data
sarTrain <- states_sf_AggTrain %>% 
  ## Make it a dataframe/tibble
  as_data_frame() %>% 
  ## Drop the features that we dropped before the analysis in Project 1, as 
  ## well as the features we do NOT want to use as predictors!
  select(-Median_RawPredictedReadmissionRate, 
         -Median_RawMedicareSpending, 
         -Median_RawHospitalReturnDays, 
         -Median_RawDeathRate, 
         -ID,
         -localI, 
         -pValue, 
         -quadrant, 
         -geom, 
         -lag_medianReadmissions,
  ## Make sure to remove any correlated variables/variables we don't want to
  ## test from the Census data!!
         -totalPop,
         -adultPop,
         -over65Pop, 
         -percBelowPoverty,
         -landSqMi, 
         -popDensity)

## TESTING
            ## Start from the testing data
sarTest <- states_sf_AggTest %>% 
  ## Make it a dataframe/tibble
  as_data_frame() %>% 
  ## Move the ID column to the rownames
  column_to_rownames("ID") %>% 
  ## Drop the features that we dropped before the analysis in Project 1, as 
  ## well as the features we do NOT want to use as predictors!
  select(-Median_RawPredictedReadmissionRate, 
         -Median_RawMedicareSpending, 
         -Median_RawHospitalReturnDays, 
         -Median_RawDeathRate,
         -geom,
         ## Make sure to remove any correlated variables/variables we don't want to
         ## test from the Census data!!
         -totalPop,
         -adultPop,
         -over65Pop, 
         -percBelowPoverty,
         -landSqMi, 
         -popDensity) %>% 
  ## Have to drop any missing values from the testing data
  drop_na()

Fit the OLS regression and test Moran’s I on the residuals

OLS <- lm(bc_PredictedReadmissionRate ~ ., data = sarTrain)
## Moran's I test on residuals
moran.test(residuals(OLS), states_sw)
## 
##  Moran I test under randomisation
## 
## data:  residuals(OLS)  
## weights: states_sw    
## 
## Moran I statistic standard deviate = 1.5815, p-value = 0.05688
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.132540584      -0.020833333       0.009405215

The Moran’s I statistic is 0.132540584 and the p-value is 0.05688, which is greater than 0.05. Therefore, we cannot reject the null hypothesis. There does not appear to be spatial autocorrelation in the residuals.

Fit the SAR and look at \(\rho\) and the LM test

## SAR (spatial lag model)
SAR <- lagsarlm(bc_PredictedReadmissionRate ~ ., data = sarTrain, 
                listw = states_sw)
## Uncomment to view results!
summary(SAR)
## 
## Call:lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain, 
##     listw = states_sw)
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -0.64235152 -0.22483126  0.00090686  0.16937740  0.72117364 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                                                                 Estimate
## (Intercept)                                                    0.5121028
## MRSA.Bacteremia                                               -0.0343097
## Abdomen.CT.Use.of.Contrast.Material                           -0.0217936
## Death.rate.for.pneumonia.patients                             -0.0823365
## Postoperative.respiratory.failure.rate                        -0.6055932
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  0.9462358
## Composite.patient.safety                                      -0.3417309
## Medicare.spending.per.patient                                  0.4773302
## Healthcare.workers.given.influenza.vaccination                 0.0346112
## Left.before.being.seen                                        -0.0030515
## Venous.Thromboembolism.Prophylaxis                            -0.1341350
## Doctor.communication                                           0.3760804
## Communication.about.medicines                                 -0.2087982
## Discharge.information                                         -0.2700228
## Cleanliness                                                   -0.3367435
## Quietness                                                     -0.3288389
## Hospital.return.days.for.pneumonia.patients                    0.1260897
## over65PercentPop                                              -2.2547639
##                                                               Std. Error
## (Intercept)                                                    0.6341337
## MRSA.Bacteremia                                                0.2708418
## Abdomen.CT.Use.of.Contrast.Material                            0.1623636
## Death.rate.for.pneumonia.patients                              0.2017153
## Postoperative.respiratory.failure.rate                         0.3568328
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  0.2706946
## Composite.patient.safety                                       0.2209597
## Medicare.spending.per.patient                                  0.1465484
## Healthcare.workers.given.influenza.vaccination                 0.1153324
## Left.before.being.seen                                         0.1522869
## Venous.Thromboembolism.Prophylaxis                             0.2305776
## Doctor.communication                                           0.2042370
## Communication.about.medicines                                  0.2467482
## Discharge.information                                          0.1873420
## Cleanliness                                                    0.2230235
## Quietness                                                      0.1335444
## Hospital.return.days.for.pneumonia.patients                    0.1377030
## over65PercentPop                                               2.8224039
##                                                               z value Pr(>|z|)
## (Intercept)                                                    0.8076 0.419342
## MRSA.Bacteremia                                               -0.1267 0.899195
## Abdomen.CT.Use.of.Contrast.Material                           -0.1342 0.893223
## Death.rate.for.pneumonia.patients                             -0.4082 0.683140
## Postoperative.respiratory.failure.rate                        -1.6971 0.089671
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate  3.4956 0.000473
## Composite.patient.safety                                      -1.5466 0.121966
## Medicare.spending.per.patient                                  3.2571 0.001125
## Healthcare.workers.given.influenza.vaccination                 0.3001 0.764101
## Left.before.being.seen                                        -0.0200 0.984013
## Venous.Thromboembolism.Prophylaxis                            -0.5817 0.560745
## Doctor.communication                                           1.8414 0.065564
## Communication.about.medicines                                 -0.8462 0.397441
## Discharge.information                                         -1.4413 0.149490
## Cleanliness                                                   -1.5099 0.131069
## Quietness                                                     -2.4624 0.013801
## Hospital.return.days.for.pneumonia.patients                    0.9157 0.359843
## over65PercentPop                                              -0.7989 0.424360
## 
## Rho: 0.36237, LR test value: 5.3676, p-value: 0.020514
## Asymptotic standard error: 0.13224
##     z-value: 2.7403, p-value: 0.0061381
## Wald statistic: 7.5093, p-value: 0.0061381
## 
## Log likelihood: -12.44762 for lag model
## ML residual variance (sigma squared): 0.09407, (sigma: 0.30671)
## Number of observations: 49 
## Number of parameters estimated: 20 
## AIC: NA (not available for weighted model), (AIC for lm: 68.263)
## LM test for residual autocorrelation
## test value: 0.083488, p-value: 0.77262

The value of rho is 0.36237 and its p-value is 0.020514, which is less than 0.05. This suggests that there is spatial dependence in our target variable. This does not surprise me, as it is exactly what we had seen before in the two earlier models. The LM test has a test value of 0.083488 and a p-value of 0.77262, which is greater than 0.05. This suggests that there is not spatial autocorrelation in the residuals. These two scores together suggest that a SAR model is more appropriate than a SEM model, as we see signs of spatial spillover in the target but no spatial correlation in the residuals.

Compare the SAR to the OLS model with a Likelihood Ratio Test (LRT)

## Give it the larger model first, then the null model.
## SAR is our larger model; the null OLS is nested within the SAR because
## the SAR = OLS when rho = 0!
anova(SAR, OLS)
##     Model df    AIC  logLik Test L.Ratio  p-value
## SAR     1 20 64.895 -12.448    1                 
## OLS     2 19 68.263 -15.131    2  5.3676 0.020514

The SAR model has a lower AIC and a higher log-likelihood than the OLS model, suggesting that it is a better fitting model.

Inferential statistics

## Print a table using gtsummary
tbl_regression(SAR) %>% 
  modify_caption("Table 7. Spatial Lag (SAR) Regression results.") %>% 
  modify_header(label = "**Coefficient**") %>% 
  add_significance_stars(hide_p = FALSE)
Table 7. Spatial Lag (SAR) Regression results.
Coefficient Beta1 SE p-value
rho 0.36** 0.132 0.006
(Intercept) 0.51 0.634 0.4
MRSA.Bacteremia -0.03 0.271 0.9
Abdomen.CT.Use.of.Contrast.Material -0.02 0.162 0.9
Death.rate.for.pneumonia.patients -0.08 0.202 0.7
Postoperative.respiratory.failure.rate -0.61 0.357 0.090
Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.95*** 0.271 <0.001
Composite.patient.safety -0.34 0.221 0.12
Medicare.spending.per.patient 0.48** 0.147 0.001
Healthcare.workers.given.influenza.vaccination 0.03 0.115 0.8
Left.before.being.seen 0.00 0.152 >0.9
Venous.Thromboembolism.Prophylaxis -0.13 0.231 0.6
Doctor.communication 0.38 0.204 0.066
Communication.about.medicines -0.21 0.247 0.4
Discharge.information -0.27 0.187 0.15
Cleanliness -0.34 0.223 0.13
Quietness -0.33* 0.134 0.014
Hospital.return.days.for.pneumonia.patients 0.13 0.138 0.4
over65PercentPop -2.3 2.82 0.4
Abbreviations: CI = Confidence Interval, SE = Standard Error
1 *p<0.05; **p<0.01; ***p<0.001

The significant predictors in this model are:

Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate Medicare.spending.per.patient Quietness

This lines up with our other SAR models. However, we see that over65PercentPop is not significant, with a p-value of 0.4.

The beta coefficient tells us that a one unit increase in over65PercentPop results in a 2.3 unit decrease in readmission rate (on the scale of our transformed data).

Fit SEM and SDM

## Fit the SEM (spatial error model) using the training data and weights matrix
SEM <- errorsarlm(bc_PredictedReadmissionRate ~ ., data = sarTrain, 
                  listw = states_sw)
## Uncomment to view results!
## summary(SEM)

## Fit the SDM (spatial Durbin model) using the training data and weights matrix
SDM <- lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain, 
    listw = states_sw, type = "mixed") ## "mixed" is necessary to specify as SDM
## Uncomment to view results!
## summary(SDM)

Predictions using the testing data on SAR, SEM, and SDM

## Fit the predictions for the FOUR models using the testing data!
olsPredictions <- predict(OLS, newdata = sarTest)

sarPredictions <- predict(SAR, newdata = sarTest, listw = states_sw) %>% 
  as_tibble()

semPredictions <- predict(SAR, newdata = sarTest, listw = states_sw) %>% 
  as_tibble()

sdmPredictions <- predict(SAR, newdata = sarTest, listw = states_sw) %>% 
  as_tibble()

Calculate the \(RMSE\) and \(MAE\) and compare

## RMSE
olsRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - olsPredictions)^2))
sarRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - sarPredictions$fit)^2))
semRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - semPredictions$fit)^2))
sdmRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - sdmPredictions$fit)^2))

## MAE
olsMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - olsPredictions))
sarMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - sarPredictions$fit))
semMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - semPredictions$fit))
sdmMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - sdmPredictions$fit))

## Make a dataframe for comparison of RMSE and MAE values
data.frame(Model = c("OLS", "SAR (Spatial Lag)", "SEM (Spatial Error)", "SDM (Durbin)"),
           RMSE = round(c(olsRMSE, sarRMSE, semRMSE, sdmRMSE), 2),
           MAE = round(c(olsMAE, sarMAE, semMAE, sdmMAE),2)) %>% 
  ## Display with kable
  kable(digits = 2,
        format = "html",
        caption = "Table 13. Comparison of RMSE and MAE for all models.") %>%
  kable_styling(bootstrap_options = c("hover", full_width = F)
  )
Table 13. Comparison of RMSE and MAE for all models.
Model RMSE MAE
OLS 1.33 0.85
SAR (Spatial Lag) 1.13 0.79
SEM (Spatial Error) 1.13 0.79
SDM (Durbin) 1.13 0.79

The comparison suggests that SAR, SEM, and SDM all outperform the OLS model. However, those three models are equal in performance when comparing RMSE and MAE.

However, all of these models perform worse than the model built using population density. Considering that over65PercentPop was considered to be not significant, this is not surprising. Thus, while we cannot rule out there being other spatial factors unaccounted for by our population density model, it does not appear as if over65PercentPop is one of those factors.